- Console
- Change Kernel…
- Clear Console Cells
- Close and Shut Down…
- Insert Line Break
- Interrupt Kernel
- New Console
- Restart Kernel…
- Run Cell (forced)
- Run Cell (unforced)
- Show All Kernel Activity
- Extension Manager
- Enable Extension Manager (experimental)
- File Operations
- Autosave Documents
- Open from Path…Open from path
- Reload Notebook from DiskReload contents from disk
- Revert Notebook to CheckpointRevert contents to previous checkpoint
- Save NotebookSave and create checkpointCtrl+S
- Save Notebook As…Save with new pathCtrl+Shift+S
- Show Active File in File Browser
- Trust HTML File
- Help
- Jupyter Reference
- JupyterLab FAQ
- JupyterLab Reference
- Launch Classic Notebook
- Markdown Reference
- Reset Application State
- Image Viewer
- Flip image horizontallyH
- Flip image verticallyV
- Invert ColorsI
- Reset Image0
- Rotate Clockwise]
- Rotate Counterclockwise[
- Zoom In=
- Zoom Out-
- Kernel Operations
- Shut Down All Kernels…
- Launcher
- New Launcher
- Main Area
- Activate Next TabCtrl+Shift+]
- Activate Previous TabCtrl+Shift+[
- Activate Previously Used TabCtrl+Shift+'
- Close All Other Tabs
- Close All Tabs
- Close TabAlt+W
- Close Tabs to Right
- Find NextCtrl+G
- Find PreviousCtrl+Shift+G
- Find…Ctrl+F
- Log OutLog out of JupyterLab
- Presentation Mode
- Show Left SidebarCtrl+B
- Show Log Console
- Show Status Bar
- Shut DownShut down JupyterLab
- Single-Document ModeCtrl+Shift+D
- Notebook Cell Operations
- Change to Code Cell TypeY
- Change to Heading 11
- Change to Heading 22
- Change to Heading 33
- Change to Heading 44
- Change to Heading 55
- Change to Heading 66
- Change to Markdown Cell TypeM
- Change to Raw Cell TypeR
- Clear Outputs
- Collapse All Code
- Collapse All Outputs
- Collapse Selected Code
- Collapse Selected Outputs
- Copy CellsC
- Cut CellsX
- Delete CellsD, D
- Disable Scrolling for Outputs
- Enable Scrolling for Outputs
- Expand All Code
- Expand All Outputs
- Expand Selected Code
- Expand Selected Outputs
- Extend Selection AboveShift+K
- Extend Selection BelowShift+J
- Extend Selection to BottomShift+End
- Extend Selection to TopShift+Home
- Insert Cell AboveA
- Insert Cell BelowB
- Merge Selected CellsShift+M
- Move Cells Down
- Move Cells Up
- Paste Cells Above
- Paste Cells and Replace
- Paste Cells BelowV
- Redo Cell OperationShift+Z
- Run Selected CellsShift+Enter
- Run Selected Cells and Don't AdvanceCtrl+Enter
- Run Selected Cells and Insert BelowAlt+Enter
- Run Selected Text or Current Line in Console
- Select Cell AboveK
- Select Cell BelowJ
- Split CellCtrl+Shift+-
- Undo Cell OperationZ
- Notebook Operations
- Change Kernel…
- Clear All Outputs
- Close and Shut Down
- Deselect All Cells
- Enter Command ModeCtrl+M
- Enter Edit ModeEnter
- Export Notebook to Asciidoc
- Export Notebook to Executable Script
- Export Notebook to HTML
- Export Notebook to Html_ch
- Export Notebook to Html_embed
- Export Notebook to Html_toc
- Export Notebook to Html_with_lenvs
- Export Notebook to Html_with_toclenvs
- Export Notebook to LaTeX
- Export Notebook to Latex_with_lenvs
- Export Notebook to Markdown
- Export Notebook to PDF
- Export Notebook to ReStructured Text
- Export Notebook to Reveal.js Slides
- Export Notebook to SelectLanguage
- Export Notebook to Slides_with_lenvs
- Interrupt Kernel
- New Console for Notebook
- New NotebookCreate a new notebook
- Reconnect To Kernel
- Render All Markdown Cells
- Restart Kernel and Clear All Outputs…
- Restart Kernel and Run All Cells…
- Restart Kernel…
- Run All Above Selected Cell
- Run All Cells
- Run Selected Cell and All Below
- Select All CellsCtrl+A
- Toggle All Line NumbersShift+L
- Toggle Recording Cell Timing
- Trust Notebook
- Settings
- Advanced Settings EditorCtrl+,
- Show Contextual Help
- Show Contextual HelpLive updating code documentation from the active kernelCtrl+I
- Terminal
- Decrease Terminal Font Size
- Increase Terminal Font Size
- New TerminalStart a new terminal session
- Refresh TerminalRefresh the current terminal session
- Use Dark Terminal ThemeSet the terminal theme
- Use Inherit Terminal ThemeSet the terminal theme
- Use Light Terminal ThemeSet the terminal theme
- Text Editor
- Decrease Font Size
- Increase Font Size
- Indent with Tab
- New Markdown FileCreate a new markdown file
- New Text FileCreate a new text file
- Spaces: 1
- Spaces: 2
- Spaces: 4
- Spaces: 8
- Theme
- Decrease Code Font Size
- Decrease Content Font Size
- Decrease UI Font Size
- Increase Code Font Size
- Increase Content Font Size
- Increase UI Font Size
- Theme Scrollbars
- Use JupyterLab Dark Theme
- Use JupyterLab Light Theme
- Variable Inspector
- Open Variable Inspector
- downloads10 months ago
- MaxQuant2 days ago
- metadata2 days ago
- outputs10 months ago
- 200826_expt296_p3825.ipynb9 months ago
- e296_Sequence-Agnostic RNP Purification_redact - Copy.ipynb33 minutes ago
- e296_Sequence-Agnostic RNP Purification_redact.ipynb33 minutes ago
- e296_Sequence-Agnostic RNP Purification.ipynb7 months ago
Terminal Sessions
Kernel Sessions
- 200906.ipynb
- e296_Sequence-Agnostic RNP Purification_redact.ipynb
- e296_Sequence-Agnostic RNP Purification_redact - Copy.ipynb
- e296_Sequence-Agnostic RNP Purification_redact.ipynb
- e296_Sequence-Agnostic RNP Purification_redact - Copy.ipynb
- JupyterLab extension for running LaTeX
- Table of Contents extension for JupyterLab
- A JupyterLab extension.
- A JupyterLab extension for showing Notebook diffs.
- 200826 Expt.296: Sequence-Agnostic RNP Purificationplaceholder
- 1. Import Modules and Files
- 2. Metadata Creation
- 3. Re-Annotate the MaxQuant pGroups with stable Gene IDs
- 4. Review Contaminants by Sample
- 5. Assess Digestion Efficiencyplaceholder
- Results: Average
- 6. Remove Contaminants
- 7. Drop Gene Duplicates and Filter Intensities by LFQ
- 8. Review Sample Clustering by Group
- 9. Analyse Normalisation Effects by Sample
- 10. Compare Intensity Distribution and Sequence Coverage
- 11. Compare Sum Peptide Countsplaceholder
- 12. Compare Unique Gene Counts
- 13. Assess Replicate Correlation
- 14. Classify RBP
- 15. Compare RBP Identity Between Conditions B3 and B4 Treatments
- 16. Review Basic Gene Ontologyplaceholder
- Result: GOOD
- 17. Retrieve GO Records
xxxxxxxxxx# 200826 Expt.296: Sequence-Agnostic RNP Purification xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx# 200826 Expt.296: Sequence-Agnostic RNP Purification _NB: This demonstration analysis was performed on unpublished development data. The original experiment was designed to investigate the conditions under which RBP-RNA complexes could be purified without hybridisation. The data shown here is only one part of 4 datasets in the experiment (including the protocol which worked)_ __Aims:__ A. To investigate different buffer compositions and elution conditions that may reveal a capacity for the purification of RBP-RNA complexes via selective precipitation.__Conditions__ Buffer 3 Elution 3 (b3e3) Buffer 3 Elution 4 (b3e4) Buffer 4 Elution 3 (b4e3) Buffer 4 Elution 4 (b4e4) 200826 Expt.296: Sequence-Agnostic RNP Purification
NB: This demonstration analysis was performed on unpublished development data. The original experiment was designed to investigate the conditions under which RBP-RNA complexes could be purified without hybridisation. The data shown here is only one part of 4 datasets in the experiment (including the protocol which worked)
Aims:
A. To investigate different buffer compositions and elution conditions that may reveal a capacity for the purification of RBP-RNA complexes via selective precipitation.
Conditions
Buffer 3 Elution 3 (b3e3)
Buffer 3 Elution 4 (b3e4)
Buffer 4 Elution 3 (b4e3)
Buffer 4 Elution 4 (b4e4)
xxxxxxxxxx1. Import Modules and Files
Custom Functions
jwrangle.importMixedFiles( )
I generally import everything I MIGHT use at the start and set up pathing using the OS-agnostic pathlib.
#### Import necessary modulesimport osimport pandas as pdfrom pathlib import Pathimport copyimport jwrangleimport jvisimport jinspectimport jtestimport jwebfrom imp import reloadimport upsetplot as upsetfrom Bio import SeqIOimport altair as altimport numpy as npimport preprocessimport seaborn as sns#### define working directoriescwd = Path(os.getcwd())base_path = Path(os.path.join(*cwd.parts[:cwd.parts.index('experiments')]))#### MaxQuant proteinGroups & evidence filesMQ_folder = jwrangle.importMixedFiles(cwd / 'MaxQuant')MQ_folder.keys()pGroups = MQ_folder['proteinGroups.txt']evidence = MQ_folder['evidence.txt']#### Inspect MQ setupMQ_folder['parameters.txt'].head(9)C:\Users\smith.j\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3242: DtypeWarning: Columns (53,54,61) have mixed types.Specify dtype option on import or set low_memory=False. if (await self.run_code(code, result, async_=asy)): C:\Users\smith.j\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3242: DtypeWarning: Columns (321,322) have mixed types.Specify dtype option on import or set low_memory=False. if (await self.run_code(code, result, async_=asy)):
| Parameter | Value | |
|---|---|---|
| 0 | Version | 1.6.7.0 |
| 1 | User name | smith.j |
| 2 | Machine name | MSCYPHER-253 |
| 3 | Date of writing | 08/26/2020 20:00:00 |
| 4 | Include contaminants | True |
| 5 | PSM FDR | 0.01 |
| 6 | PSM FDR Crosslink | 0.01 |
| 7 | Protein FDR | 0.01 |
| 8 | Site FDR | 0.01 |
xxxxxxxxxx2. Metadata Creation
Custom Functions
jwrangle.MQ_writeMetadata( )
Metadata tabulates the test conditions for ALL experiments that shared the same MQ search and thuss all experiments that comprise the MQ outputs. Metadata can also be done in a spreadsheet program. Here, I have created my metadata programmatically instead (I simply find this easier).
The metadata table gives users the opportunity to rename samples and define the experimental parameters for the data. This task can be expecially complex for MaxQuant because a unified output is generated even if distinctly separate experiments are searched as a batch and with different parameters applied.
The function jwrangle.MQ_writeMetadata( ) will take a metadata table, rename all samples in the proteinGroups and evidence files, assign alternative filenames, and save new copies to be used in future analyses.
#### Inspect column namescolnames = list(pGroups.columns.values)#### Derive experiment names as a list#### I gave the original samples some irritating names that I'd like to correctexperiment_names = []for i in colnames: if 'Intensity ' in i: experiment_names.append(i.replace('Intensity ', ''))new_expt_names = []for i in experiment_names: i = i.replace('pH8S', 'pH7-S') i = i.replace('ne', 'ne-') i = i.replace('NC', 'nCL') i = i.replace('PC', '254') i = i.replace('heat', '-heat') i = i.replace('obd', '-obd') new_expt_names.append(i) #### Create a list of associated conditions#### These samples were mislabeled 'NC' and 'PC' (the latter is often short for PAR-CL, but these samples are 254nm crosslinked)condition = ['b3e3_nCL']*4 + ['b3e3_254']*4 + \ ['b4e3_nCL']*4 + ['b4e3_254']*4 + \ ['b3e4_nCL']*4 + ['b3e4_254']*4 + \ ['b4e4_nCL']*4 + ['b4e4_254']*4#### Create a list of associated replicate identifiersreplicate = []for i in new_expt_names: replicate.append(i.split('-')[-1])#### Create a more reader friendly list of sample namessamples = []for pos, label in enumerate(condition): samples.append(label+new_expt_names[pos][-2:]) #### The MQ_groups dictionary assigns each condition to its MQ group parameterMQ_groups = {'b3e3':['b3e3_nCL','b3e3_254'], 'b4e3':['b4e3_nCL','b4e3_254'], 'b3e4':['b3e4_nCL','b3e4_254'], 'b4e4':['b4e4_nCL','b4e4_254']}#### We then use the condition list to create an MQ_groups listGrp_parameters = []for label in condition: for key, value in MQ_groups.items(): if label in value: Grp_parameters.append(key)exp_df = pd.DataFrame( {'experiment': experiment_names, 'condition': condition, 'replicate': replicate, 'sample':samples, 'measure':['Intensity']*len(samples), # adding this column allows our metadata file to be compatible with Proteus 'MQgroups':Grp_parameters })exp_df| experiment | condition | replicate | sample | measure | MQgroups | |
|---|---|---|---|---|---|---|
| 0 | 17_pH8Silane20heat_NC-A | b3e3_nCL | A | b3e3_nCL-A | Intensity | b3e3 |
| 1 | 18_pH8Silane20heat_NC-B | b3e3_nCL | B | b3e3_nCL-B | Intensity | b3e3 |
| 2 | 19_pH8Silane20heat_NC-C | b3e3_nCL | C | b3e3_nCL-C | Intensity | b3e3 |
| 3 | 20_pH8Silane20heat_NC-D | b3e3_nCL | D | b3e3_nCL-D | Intensity | b3e3 |
| 4 | 21_pH8Silane20heat_PC-A | b3e3_254 | A | b3e3_254-A | Intensity | b3e3 |
| 5 | 22_pH8Silane20heat_PC-B | b3e3_254 | B | b3e3_254-B | Intensity | b3e3 |
| 6 | 23_pH8Silane20heat_PC-C | b3e3_254 | C | b3e3_254-C | Intensity | b3e3 |
| 7 | 24_pH8Silane20heat_PC-D | b3e3_254 | D | b3e3_254-D | Intensity | b3e3 |
| 8 | 25_pH8Silane30heat_NC-A | b4e3_nCL | A | b4e3_nCL-A | Intensity | b4e3 |
| 9 | 26_pH8Silane30heat_NC-B | b4e3_nCL | B | b4e3_nCL-B | Intensity | b4e3 |
| 10 | 27_pH8Silane30heat_NC-C | b4e3_nCL | C | b4e3_nCL-C | Intensity | b4e3 |
| 11 | 28_pH8Silane30heat_NC-D | b4e3_nCL | D | b4e3_nCL-D | Intensity | b4e3 |
| 12 | 29_pH8Silane30heat_PC-A | b4e3_254 | A | b4e3_254-A | Intensity | b4e3 |
| 13 | 30_pH8Silane30heat_PC-B | b4e3_254 | B | b4e3_254-B | Intensity | b4e3 |
| 14 | 31_pH8Silane30heat_PC-C | b4e3_254 | C | b4e3_254-C | Intensity | b4e3 |
| 15 | 32_pH8Silane30heat_PC-D | b4e3_254 | D | b4e3_254-D | Intensity | b4e3 |
| 16 | 33_pH8Silane20obd_NC-A | b3e4_nCL | A | b3e4_nCL-A | Intensity | b3e4 |
| 17 | 34_pH8Silane20obd_NC-B | b3e4_nCL | B | b3e4_nCL-B | Intensity | b3e4 |
| 18 | 35_pH8Silane20obd_NC-C | b3e4_nCL | C | b3e4_nCL-C | Intensity | b3e4 |
| 19 | 36_pH8Silane20obd_NC-D | b3e4_nCL | D | b3e4_nCL-D | Intensity | b3e4 |
| 20 | 37_pH8Silane20obd_PC-A | b3e4_254 | A | b3e4_254-A | Intensity | b3e4 |
| 21 | 38_pH8Silane20obd_PC-B | b3e4_254 | B | b3e4_254-B | Intensity | b3e4 |
| 22 | 39_pH8Silane20obd_PC-C | b3e4_254 | C | b3e4_254-C | Intensity | b3e4 |
| 23 | 40_pH8Silane20obd_PC-D | b3e4_254 | D | b3e4_254-D | Intensity | b3e4 |
| 24 | 41_pH8Silane30obd_NC-A | b4e4_nCL | A | b4e4_nCL-A | Intensity | b4e4 |
| 25 | 42_pH8Silane30obd_NC-B | b4e4_nCL | B | b4e4_nCL-B | Intensity | b4e4 |
| 26 | 43_pH8Silane30obd_NC-C | b4e4_nCL | C | b4e4_nCL-C | Intensity | b4e4 |
| 27 | 44_pH8Silane30obd_NC-D | b4e4_nCL | D | b4e4_nCL-D | Intensity | b4e4 |
| 28 | 46_pH8Silane30obd_PC-B | b4e4_254 | B | b4e4_254-B | Intensity | b4e4 |
| 29 | 46_pH8Silane30obd_PC-F | b4e4_254 | F | b4e4_254-F | Intensity | b4e4 |
| 30 | 47_pH8Silane30obd_PC-C | b4e4_254 | C | b4e4_254-C | Intensity | b4e4 |
| 31 | 48_pH8Silane30obd_PC-D | b4e4_254 | D | b4e4_254-D | Intensity | b4e4 |
MQ_expt296 = jwrangle.MQ_writeMetadata(pGroups, evidence, exp_df, 'e296r', cwd)metadata.csv created in metadata folder e296r_proteinGroups_metalabeled.txt created in MaxQuant folder e296r_evidence_metalabeled.txt created in MaxQuant folder
xxxxxxxxxx3. Re-Annotate the MaxQuant pGroups with stable Gene IDs
Functions
jweb.mapAnyID( )
jwrangle.importMixedFiles( )
MaxQuant does a good job of assigning a Gene name to each protein group. Presumably these gene names come from the FASTA. However:
- Sometimes it fails to find a gene name
- Sometimes it will assign an ID that is not a gene or include out-of-place characters
- It doesn't always seem to be consistent
- If the gene name originates from the FASTA then repeating the MQ search with an updated FASTA is the only way to update the gene IDs.
- Use of a mapping service will standardise the ID conversion practices between my datasets and those of others, including RNA-Seq.
To avoid these problems we will remap the Majority protein IDs to ENTREZ gene IDs. jweb.mapAnyID( ) will retrieve all possible genes for each protein group, and will also select a primary ID to singularly represent the group by a consistent method. This is a very flexible function, see help( ) for further explanation. From this point, the MQ 'Gene names' column will no longer be necessary. This function can also handle ID mapping to and from almost any convention.
Ensuring our proteins have a consistent gene naming strategy is essential for inter-experiment comparison and the later use of set methods. It also creates a standard that can be applied for accurately mapping RNA-Seq results and thus aid in future mapping of protein-RNA partners.
xxxxxxxxxx#### If not already loaded, read in the metadata-adjusted filesmetadata = pd.read_csv(cwd / 'metadata' / 'e296r_metadata.csv', index_col = 0)pGroups = pd.read_csv(cwd / 'MaxQuant' / 'e296r_proteinGroups_metalabeled.txt', delimiter = '\t')evidence = pd.read_csv(cwd / 'MaxQuant' / 'e296r_evidence_metalabeled.txt', delimiter = '\t')C:\Users\smith.j\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3051: DtypeWarning: Columns (321,322) have mixed types.Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result) C:\Users\smith.j\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3051: DtypeWarning: Columns (53,54,61) have mixed types.Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)
xxxxxxxxxx#### Dynamically remap gene names in our proteinGroups file and save a copypGroups_remap = jweb.mapAnyID_gPro(pGroups['Majority protein IDs'].tolist(), splitstr = [';', '-'], geneProductType = 'protein', gConvertOrganism = 'hsapiens', gConvertTarget = 'ENTREZGENE', writetopath = [cwd, 'pGroups_remap'], writeTargetsAsList = 'NO')xxxxxxxxxx#### If not already loaded, read in the remapped proteinGroups filepGroups_remap = jwrangle.importMixedFiles(cwd / 'downloads' / 'pGroups_remap', dropSuffix = 'yes')pGroups_remap.keys()dict_keys(['id_map', 'query_map'])
xxxxxxxxxx#### jwrangle.importMixedFiles() returns a dictionary where keys = files. We want the 'id_map' table created by jweb.mapAnyID_gPro().#### We'll rename the Query column and drop duplicates so the table can be merged with our proteinGroups table.id_map = pGroups_remap['id_map'].rename(columns={'Query': 'Majority protein IDs'}).drop_duplicates()#### Now use merge to add these new columns to our proteinGroups tablepGroups_map = pd.merge(pGroups, id_map, on='Majority protein IDs', how='left')#### Check the tables are merged by viewing column elements from each.pGroups_map[id_map.columns.tolist() + ['Peptide IDs']].head(2)| Majority protein IDs | ENTREZGENE_gPro all | ENTREZGENE_gPro primary | ENTREZGENE_gPro name | UNIPROT_gPro status | Peptide IDs | |
|---|---|---|---|---|---|---|
| 0 | A0A024R4E5;Q00341;Q00341-2;H0Y394 | HDLBP;nan | HDLBP | high density lipoprotein binding protein [Sour... | SWISSPROT | 57;711;1649;1671;1940;2102;2775;2824;3106;3440... |
| 1 | A0A024R4M0;P46781;B5MCT8;C9JM19 | RPS9 | RPS9 | ribosomal protein S9 [Source:HGNC Symbol;Acc:H... | SWISSPROT | 6085;6086;10517;11131;11450;13657;14430;14702;... |
xxxxxxxxxx4. Review Contaminants by Sample
Functions
jinspect.MQ_getContaminants( )
MQ_getContaminants_sbplot( )
jwrangle.importMixedFiles( )
We can extract the conaminants from our proteinGroups file using jinspect.MQ_getContaminants( ). These extracted table will return log2(iBAQ values).
Contaminants can then be reviewed with MQ_getContaminants_sbplot( ).
#### Extract contaminantscontaminants = jinspect.MQ_getContaminants(pGroups_map, metadata)#### Sort the metadata into a more intuitive ordermetadata_sort = metadata.sort_values(by = ['MQgroups','condition', 'replicate'], ascending=[True, False, True])#### Visually inspect contaminants jvis.MQ_getContaminants_sbplot(contaminants, metadata_sort, width = 3, length = 2, layout = 'single')<Figure size 432x288 with 0 Axes>
xxxxxxxxxx## 5. Assess Digestion Efficiency __Functions__ jinspect.MQ_getMissedCleavages( ) jvis.CommonPalettesAsHex jvis.BarPlotByGroup_sbplot( ) Assessing missed cleavages is an essential metric for understanding the quality of the tryptic digestion. This data is recorded in the evidence file. __jinspect.MQ_getMissedCleavages( )___ will return a long form data table that can easily be used for plotting. The dictionary __jvis.CommonPalettesAsHex__ contains a number of palettes that are common to both matplotlib and ggplot (from R). These are provided to ensure consistency is easy to achieve across both languages. We'll plot the missed cleavages with the generic function __jvis.BarPlotByGroup_sbplot( )__ 5. Assess Digestion Efficiency
Functions
jinspect.MQ_getMissedCleavages( )
jvis.CommonPalettesAsHex
jvis.BarPlotByGroup_sbplot( )
Assessing missed cleavages is an essential metric for understanding the quality of the tryptic digestion. This data is recorded in the evidence file.
jinspect.MQ_getMissedCleavages( )_ will return a long form data table that can easily be used for plotting. The dictionary jvis.CommonPalettesAsHex contains a number of palettes that are common to both matplotlib and ggplot (from R). These are provided to ensure consistency is easy to achieve across both languages. We'll plot the missed cleavages with the generic function jvis.BarPlotByGroup_sbplot( )
#### Extract the missed cleavage data into a long form table for plottingMissedCleavages = jinspect.MQ_getMissedCleavages(evidence, metadata, drop_contaminants = True)MissedCleavages.sort_values(by=['expt','sample'], inplace = True)MissedCleavages| sample | % Missed Cleavages | group | expt | |
|---|---|---|---|---|
| 10 | b3e3_254-A | 24 | b3e3_254 | b3e3 |
| 5 | b3e3_254-B | 24 | b3e3_254 | b3e3 |
| 14 | b3e3_254-C | 24 | b3e3_254 | b3e3 |
| 15 | b3e3_254-D | 24 | b3e3_254 | b3e3 |
| 12 | b3e3_nCL-A | 22 | b3e3_nCL | b3e3 |
| 16 | b3e3_nCL-B | 21 | b3e3_nCL | b3e3 |
| 20 | b3e3_nCL-C | 16 | b3e3_nCL | b3e3 |
| 31 | b3e3_nCL-D | 17 | b3e3_nCL | b3e3 |
| 21 | b3e4_254-A | 31 | b3e4_254 | b3e4 |
| 27 | b3e4_254-B | 32 | b3e4_254 | b3e4 |
| 23 | b3e4_254-C | 32 | b3e4_254 | b3e4 |
| 19 | b3e4_254-D | 32 | b3e4_254 | b3e4 |
| 3 | b3e4_nCL-A | 35 | b3e4_nCL | b3e4 |
| 28 | b3e4_nCL-B | 34 | b3e4_nCL | b3e4 |
| 2 | b3e4_nCL-C | 34 | b3e4_nCL | b3e4 |
| 11 | b3e4_nCL-D | 34 | b3e4_nCL | b3e4 |
| 29 | b4e3_254-A | 25 | b4e3_254 | b4e3 |
| 7 | b4e3_254-B | 25 | b4e3_254 | b4e3 |
| 13 | b4e3_254-C | 25 | b4e3_254 | b4e3 |
| 8 | b4e3_254-D | 26 | b4e3_254 | b4e3 |
| 26 | b4e3_nCL-A | 18 | b4e3_nCL | b4e3 |
| 0 | b4e3_nCL-B | 18 | b4e3_nCL | b4e3 |
| 9 | b4e3_nCL-C | 20 | b4e3_nCL | b4e3 |
| 6 | b4e3_nCL-D | 19 | b4e3_nCL | b4e3 |
| 30 | b4e4_254-B | 22 | b4e4_254 | b4e4 |
| 25 | b4e4_254-C | 33 | b4e4_254 | b4e4 |
| 17 | b4e4_254-D | 33 | b4e4_254 | b4e4 |
| 18 | b4e4_254-F | 32 | b4e4_254 | b4e4 |
| 1 | b4e4_nCL-A | 35 | b4e4_nCL | b4e4 |
| 24 | b4e4_nCL-B | 35 | b4e4_nCL | b4e4 |
| 22 | b4e4_nCL-C | 36 | b4e4_nCL | b4e4 |
| 4 | b4e4_nCL-D | 35 | b4e4_nCL | b4e4 |
x
### Select a colour palette cpal = jvis.CommonPalettesAsHexset2_paired = []for i in cpal['Set2_qual']: set2_paired.append(i) set2_paired.append(i)#### Plot the grouped data points sns.set_style('whitegrid')jvis.BarPlotByGroup_sbplot(MissedCleavages, x_col = 'group', y_col = '% Missed Cleavages', title = '% Missed Cleavages', pal = set2_paired)<Figure size 432x288 with 0 Axes>
xxxxxxxxxxResults: Average
Digestion efficiency isn't great. Let's reduce the digestion volume by 0.5x and maintain the same trypsin concentration in future experiments
xxxxxxxxxx## 6. Remove Contaminants __Functions__ jwrangle.MQ_getThreePassFilter( ) SeqIO.parse( )After QC we no longer want the contaminants in our data. __jwrangle.MQ_getThreePassFilter( )__ will remove reverse peptides, contaminants, and only identified by site from MQ tables. The filter will also accept customised exclusion lists in case users have added odd protein species to the search FASTA tables. In this particular experiment we added to the human FASTA, RNAse proteins and the large T antigen. The former as 1) a check that dynamic range is not being overwhelmed and 2) as an quantitative spike-in control to compare tryptic efficiency and the sample recovery across samples following C18 cleanup.6. Remove Contaminants
Functions
jwrangle.MQ_getThreePassFilter( )
SeqIO.parse( )
After QC we no longer want the contaminants in our data. jwrangle.MQ_getThreePassFilter( ) will remove reverse peptides, contaminants, and only identified by site from MQ tables.
The filter will also accept customised exclusion lists in case users have added odd protein species to the search FASTA tables. In this particular experiment we added to the human FASTA, RNAse proteins and the large T antigen. The former as 1) a check that dynamic range is not being overwhelmed and 2) as an quantitative spike-in control to compare tryptic efficiency and the sample recovery across samples following C18 cleanup.
#### Map the location of the custom FASTA elementsos.listdir(base_path / 'my_resources' / 'FASTA') #### Create a list of the non-human proteins that were added to the custom FASTA genome search. new_cont = []with open(base_path / 'my_resources' / 'FASTA' / "custom_proteome_elements.fasta", "r") as handle: for record in SeqIO.parse(handle, "fasta"): new_cont.append(record.id.split('|')[1])#### Remove all unwanted contaminants and IDs from the proteinGroups table pGroup_clean = jwrangle.MQ_getThreePassFilter(pGroups_map, custom_exclusion = new_cont)#### Inspect the cleaned dataframepGroup_clean[['ENTREZGENE_gPro primary'] + [i for i in pGroup_clean.columns if 'iBAQ' in i]].head(2)| ENTREZGENE_gPro primary | iBAQ | iBAQ b3e3_nCL-A | iBAQ b3e3_nCL-B | iBAQ b3e3_nCL-C | iBAQ b3e3_nCL-D | iBAQ b3e3_254-A | iBAQ b3e3_254-B | iBAQ b3e3_254-C | iBAQ b3e3_254-D | ... | iBAQ b3e4_254-C | iBAQ b3e4_254-D | iBAQ b4e4_nCL-A | iBAQ b4e4_nCL-B | iBAQ b4e4_nCL-C | iBAQ b4e4_nCL-D | iBAQ b4e4_254-B | iBAQ b4e4_254-F | iBAQ b4e4_254-C | iBAQ b4e4_254-D | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | HDLBP | 1.044400e+09 | 0.0 | 0.0 | 0.0 | 0.0 | 9.678600e+07 | 7.080900e+07 | 9.100200e+07 | 9.025000e+07 | ... | 60376000.0 | 50760000.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3174900.0 | 57683000.0 | 82764000.0 | 75414000.0 |
| 1 | RPS9 | 1.689600e+10 | 0.0 | 0.0 | 0.0 | 0.0 | 1.421100e+09 | 1.117600e+09 | 1.870200e+09 | 2.090800e+09 | ... | 399610000.0 | 324730000.0 | 993560.0 | 1235400.0 | 0.0 | 0.0 | 36901000.0 | 420100000.0 | 579340000.0 | 495440000.0 |
2 rows × 34 columns
xxxxxxxxxx7. Drop Gene Duplicates and Filter Intensities by LFQ
Functions
jinspect.MQ_dropDuplicateIDs( )
The next step focuses on improving confidence in the quality of our data. This is done by applying jinspect.MQ_dropDuplicateIDs( ) which has the below effects:
- Because one gene can have many proteins, sometimes Maxquant will create multiple proteinGroups for a single gene. As most of our analysis focuses on genes we'll trim the lowest quality proteinGroups duplicates from the table.
- Standard LFQ defaults require a minimum of 2 peptide species, at least one of which must be unique, for quantitation to be applied. Intensity and iBAQ values, however, do not have such a minimum limit. I consider a 2 peptide minimum to be a wise filter but still have use for the Intensity and iBAQ values. Thus where the LFQ filter is applied all measurements that do not meet the minimum limit will be discarded. In short, if there isn't a companion LFQ value, there won't be an Intensity or iBAQ value either after filtering.
- It has been documented that Match Between Runs suffers a high frequency of false peptide transfers (Lim, Paulo, Gygi 2019; doi: 10.1021/acs.jproteome.9b00492). At the protein level, however, this false transfer rate is greatly mitigated by the minimum peptide rule applied by the LFQ algorithm. This is another good reason for our filtering step.
#### Drop duplicates and apply LFQ filterfilter_dict = jinspect.MQ_dropDuplicateIDs(pGroup_clean, metadata, prefix = 'Peptides', ID = 'ENTREZGENE_gPro primary', pool = 'measure', drop_ID = 'None', keep_PoolCalcs = False, applyLFQ_filter = ['Intensity', 'iBAQ'])#### The df_keep value contains our targets, df_droprows conatins the discarded duplicates. Assign the df_keep value to a new variable and inspect.pGroup_filtered = filter_dict['df_keep']pGroup_filtered.head(2)WARNING: jinspect.MQ_getMeasuredMeansByGroup() has not been checked
| Protein IDs | Majority protein IDs | Peptide counts (all) | Peptide counts (razor+unique) | Peptide counts (unique) | Protein names | Gene names | Fasta headers | Number of proteins | Peptides | ... | iBAQ b4e3_nCL-C | iBAQ b4e3_nCL-D | iBAQ b4e4_254-B | iBAQ b4e4_254-C | iBAQ b4e4_254-D | iBAQ b4e4_254-F | iBAQ b4e4_nCL-A | iBAQ b4e4_nCL-B | iBAQ b4e4_nCL-C | iBAQ b4e4_nCL-D | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A0A024R4E5;Q00341;Q00341-2;H0Y394;H7C0A4;C9JIZ... | A0A024R4E5;Q00341;Q00341-2;H0Y394 | 56;55;50;39;18;15;13;11;10;10;10;10;9;6;6;5;5;... | 56;55;50;39;18;15;13;11;10;10;10;10;9;6;6;5;5;... | 56;55;50;39;18;15;13;11;10;10;10;10;9;6;6;5;5;... | Vigilin | HDLBP | ;;; | 23 | 56 | ... | 0.0 | 0.0 | 3174900.0 | 82764000.0 | 75414000.0 | 57683000.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | A0A024R4M0;P46781;B5MCT8;C9JM19;F2Z3C0;A8MXK4 | A0A024R4M0;P46781;B5MCT8;C9JM19 | 16;16;13;13;4;4 | 16;16;13;13;4;4 | 16;16;13;13;4;4 | 40S ribosomal protein S9 | RPS9 | ;;; | 6 | 16 | ... | 0.0 | 0.0 | 36901000.0 | 579340000.0 | 495440000.0 | 420100000.0 | 0.0 | 1235400.0 | 0.0 | 0.0 |
2 rows × 336 columns
xxxxxxxxxx8. Review Sample Clustering by Group
Functions
jtest.getDistanceMatrix( )
jvis.MQ_showDendrogramQC_mplplot( )
A distance matrix function jtest.getDistanceMatrix( ) is provided for users who wish to apply different algorithms or create different visualisations.
I like the 'ward' method for distance calculations and using a dengrogram to confirm that clustering matches expectations and so use a prerolled function jvis.MQ_showDendrogramQC_mplplot( )
#### Confirm that clustering matches expectationsjvis.MQ_showDendrogramQC_mplplot(pGroup_filtered, 'Intensity', metadata, 'QC clustering: ', grid = 'YES', fsize = (8, 8))xxxxxxxxxx9. Analyse Normalisation Effects by Sample
Functions
jwrangle.Log2_ByPrefix( )
jwrangle.MQ_poolMulti( )
jvis.ViolinCompare_sbplot( )
Here we review normalisation effects on each sample within the condition groups; these are most easily interpreted after log2 transformation. We will transform all measures of interest with jwrangle.Log2_ByPrefix( ) and then pool all the values of interest, by condition, with jwrangle.MQ_poolMulti( ). The function jvis.ViolinCompare_sbplot( ) will let use compare Intensity distribution on a per sample basis.
Normalisation is applied to LFQ values by MaxQuant and is a feature of its handling of label-free data. I've not seen a detailed explanation of how it works though so it is a leap of faith that Cox and Mann have selected an appropriate method.
Normalisation must be applied separately to nCL and cCL groups. This is unusual though necessary to avoid outrageous results caused by having groups with extreme differences. See expt.313 for evidence.
#### Log2 transform available intensity values.pGroup_log2 = jwrangle.Log2_ByPrefix(pGroup_filtered, 'LFQ intensity')pGroup_log2 = jwrangle.Log2_ByPrefix(pGroup_log2, 'iBAQ')pGroup_log2 = jwrangle.Log2_ByPrefix(pGroup_log2, 'Intensity')pGroup_log2.replace(0,np.nan, inplace=True)#### Create a long form dataset for each desired groupingpool_SampleIntensity = jwrangle.MQ_poolMulti(pGroup_log2, metadata, melt_list = ['Intensity', 'LFQ intensity'], group = 'condition')pool_SampleIntensity.keys()dict_keys(['b3e3_nCL', 'b3e3_254', 'b4e3_nCL', 'b4e3_254', 'b3e4_nCL', 'b3e4_254', 'b4e4_nCL', 'b4e4_254'])
xxxxxxxxxx#### Inspect the Intensity shifts generated by the LFQ normalisation algorithm. In MaxQuant Raw Intensity and iBAQ values are #### not subjected to the LFQ normalisation calculations so this is a good way to spot any gross violationssns.set_style('whitegrid')jvis.ViolinCompare_sbplot(pool_SampleIntensity['b3e3_nCL'], title = 'b3e3_nCL: Normalisation Effects', ylabel = 'Log2(Intensity)', palette = ['#ff6666', '#99ccff'])xxxxxxxxxx#### Inspect the Intensity shifts generated by the LFQ normalisation algorithm. In MaxQuant Raw Intensity and iBAQ values are #### not subjected to the LFQ normalisation calculations so this is a good way to spot any gross violationssns.set_style('whitegrid')jvis.ViolinCompare_sbplot(pool_SampleIntensity['b3e3_254'], title = 'b3e3_254: Normalisation Effects', ylabel = 'Log2(Intensity)', palette = ['#ff6666', '#99ccff'])xxxxxxxxxx#### Inspect the Intensity shifts generated by the LFQ normalisation algorithm. In MaxQuant Raw Intensity and iBAQ values are #### not subjected to the LFQ normalisation calculations so this is a good way to spot any gross violationssns.set_style('whitegrid')jvis.ViolinCompare_sbplot(pool_SampleIntensity['b3e4_nCL'], title = 'b3e4_nCL: Normalisation Effects', ylabel = 'Log2(Intensity)', palette = cpal['Set3_qual'])xxxxxxxxxx#### Inspect the Intensity shifts generated by the LFQ normalisation algorithm. In MaxQuant Raw Intensity and iBAQ values are #### not subjected to the LFQ normalisation calculations so this is a good way to spot any gross violationssns.set_style('whitegrid')jvis.ViolinCompare_sbplot(pool_SampleIntensity['b3e4_254'], title = 'b3e4_254: Normalisation Effects', ylabel = 'Log2(Intensity)', palette = cpal['Set3_qual'])xxxxxxxxxx#### Inspect the Intensity shifts generated by the LFQ normalisation algorithm. In MaxQuant Raw Intensity and iBAQ values are #### not subjected to the LFQ normalisation calculations so this is a good way to spot any gross violationssns.set_style('whitegrid')jvis.ViolinCompare_sbplot(pool_SampleIntensity['b4e3_nCL'], title = 'b4e3_nCL: Normalisation Effects', ylabel = 'Log2(Intensity)', palette = cpal['Set3_qual'][4:])xxxxxxxxxx#### Inspect the Intensity shifts generated by the LFQ normalisation algorithm. In MaxQuant Raw Intensity and iBAQ values are #### not subjected to the LFQ normalisation calculations so this is a good way to spot any gross violationssns.set_style('whitegrid')jvis.ViolinCompare_sbplot(pool_SampleIntensity['b4e3_254'], title = 'b4e3_254: Normalisation Effects', ylabel = 'Log2(Intensity)', palette = cpal['Set3_qual'][4:])xxxxxxxxxx#### Inspect the Intensity shifts generated by the LFQ normalisation algorithm. In MaxQuant Raw Intensity and iBAQ values are #### not subjected to the LFQ normalisation calculations so this is a good way to spot any gross violationssns.set_style('whitegrid')jvis.ViolinCompare_sbplot(pool_SampleIntensity['b4e4_nCL'], title = 'b4e4_nCL: Normalisation Effects', ylabel = 'Log2(Intensity)', palette = cpal['Set1_qual'][2:])xxxxxxxxxx#### Inspect the Intensity shifts generated by the LFQ normalisation algorithm. In MaxQuant Raw Intensity and iBAQ values are #### not subjected to the LFQ normalisation calculations so this is a good way to spot any gross violationssns.set_style('whitegrid')jvis.ViolinCompare_sbplot(pool_SampleIntensity['b4e4_nCL'], title = 'b4e4_nCL: Normalisation Effects', ylabel = 'Log2(Intensity)', palette = cpal['Set1_qual'][2:])xxxxxxxxxx10. Compare Intensity Distribution and Sequence Coverage
Functions
jwrangle.MQ_poolDataByCondition( )
jvis.BoxPlotByColumn_sbplot( )
Next we will compare intensity and sequence coverage between groups. Log2 transformation has already been performed so we need only use jwrangle.MQ_poolDataByCondition( ) to create the appropriate long form dataset for plotting with jvis.BoxPlotByColumn_sbplot( ).
xxxxxxxxxx#### Pool data into a single long form datasetpooled_dfDropGroupOne = jwrangle.MQ_poolDataByCondition(pGroup_log2, metadata_sort, prefix_list = ['Intensity', 'Sequence coverage'])#### Compare Intensity distribution using a box and whisker plotsns.set_style('whitegrid')jvis.BoxPlotByColumn_sbplot(pooled_dfDropGroupOne, 'Intensity: ', 'Intensity')xxxxxxxxxx#### Compare Sequence coverage using a box and whisker plotsns.set_style('whitegrid')jvis.BoxPlotByColumn_sbplot(pooled_dfDropGroupOne, 'Sequence coverage: ', 'Sequence coverage %')xxxxxxxxxx11. Compare Sum Peptide Counts
Functions
jinspect.MQ_getSumBySample( )
jvis.BarPlotByGroup_sbplot( )
To sum the total peptides observed across all proteins use jinspect.MQ_getSumBySample( ). These sums will be returned as a modified metadata table.
Plotting these by group is easily done with jvis.BarPlotByGroup_sbplot( ). The plotting order is determined by the metadata ordering.
In this case we are inspecting the number of peptides detected after having removed contaminants- thus if some spike-in proteins were removed, i.e. in this case RNAse treatments, they will not contribute to the peptide count. To look at the replicability of these spike-ins, we would reach back to the 'df_droprows' table generated by jinspect.MQ_dropDuplicateIDs( ) in section 7.
#### Extract the total peptides observed per samplemetaStats = jinspect.MQ_getSumBySample(pGroup_log2, metadata_sort, freqList = ['Peptides'], measure = False)#### Plot the sum peptidessns.set_style('whitegrid')jvis.BarPlotByGroup_sbplot(metaStats, x_col = 'condition', y_col = 'Peptides', title = 'Sum Peptides vs pH enrichment', pal = set2_paired, errorbars = 'SEM')<Figure size 432x288 with 0 Axes>
xxxxxxxxxx12. Compare Unique Gene Counts
Functions
jinspect.MQ_getFrequencyBySample( )
One gene can encode for many proteins that often share regions of similarity. As for illumina-based RNA-Seq, however, shotgun proteomics can rarely assign a peptide species to a singular protein. In MaxQuant these are called proteinGroups. Because we have do not require protein-specific results, and gene identity is more stable, our gene count describes the groups to which our detected proteins have been be assigned. Thus gene here is being detected by protein product, just as it would be detected by RNA product in RNA Seq; none of these 3 are synonymous. To be clear, this is a count and not a measure.
Gene frequency is defined by the summed observations per protein regardless of intensity value and this data is extracted to our modified metadata with jinspect.MQ_getFrequencyBySample( ) .
A typical MQ search will yield identical protein counts (though different values) for Intensity and iBAQ*. LFQ frequencies will vary depending on the search settings:
- In this case the MQ search has set LFQ values to be calculated on a min 2 peptide ratio (this is the default)**
Notes
* Why protein counts should be identical I don't know. The original iBAQ paper stipulates rules for the inclusion of a protein in the iBAQ calculation but MaxQuant doesn't seem to apply them.
** Previously I tested LFQ min ratio at 1 peptide. At 1 minimum peptide there was unexpected QC clustering. Possible explanations for this are explained in section 7 and are cleaned up by jinspect.MQ_dropDuplicateIDs( ) function. We can expect this function to greatly reduce qualifying IDs (~20% fewer), especially in the QE samples, but I think the trade-off is worth it because we gain 1) a more robust ID check and 2) the same search can be used for LFQ based checks of dynamic changes, i.e. comparing more than one group of cCL captures for biological changes.
xxxxxxxxxx#### Count the number of unique metaStats = jinspect.MQ_getFrequencyBySample(pGroup_log2, metaStats, freqList = ['Intensity', 'iBAQ', 'LFQ intensity'], measure = False)metaStats.iloc[:,1:]| condition | replicate | sample | measure | MQgroups | Peptides | Intensity | iBAQ | LFQ intensity | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | b3e3_nCL | A | b3e3_nCL-A | Intensity | b3e3 | 180.0 | 31 | 31 | 31 |
| 1 | b3e3_nCL | B | b3e3_nCL-B | Intensity | b3e3 | 156.0 | 32 | 32 | 32 |
| 2 | b3e3_nCL | C | b3e3_nCL-C | Intensity | b3e3 | 155.0 | 50 | 50 | 50 |
| 3 | b3e3_nCL | D | b3e3_nCL-D | Intensity | b3e3 | 125.0 | 51 | 51 | 51 |
| 4 | b3e3_254 | A | b3e3_254-A | Intensity | b3e3 | 13799.0 | 1255 | 1255 | 1255 |
| 5 | b3e3_254 | B | b3e3_254-B | Intensity | b3e3 | 13735.0 | 1253 | 1253 | 1253 |
| 6 | b3e3_254 | C | b3e3_254-C | Intensity | b3e3 | 13854.0 | 1285 | 1285 | 1285 |
| 7 | b3e3_254 | D | b3e3_254-D | Intensity | b3e3 | 13928.0 | 1528 | 1528 | 1528 |
| 8 | b3e4_nCL | A | b3e4_nCL-A | Intensity | b3e4 | 9859.0 | 697 | 697 | 697 |
| 9 | b3e4_nCL | B | b3e4_nCL-B | Intensity | b3e4 | 11362.0 | 751 | 751 | 751 |
| 10 | b3e4_nCL | C | b3e4_nCL-C | Intensity | b3e4 | 11206.0 | 758 | 758 | 758 |
| 11 | b3e4_nCL | D | b3e4_nCL-D | Intensity | b3e4 | 11832.0 | 961 | 961 | 961 |
| 12 | b3e4_254 | A | b3e4_254-A | Intensity | b3e4 | 15220.0 | 1177 | 1177 | 1177 |
| 13 | b3e4_254 | B | b3e4_254-B | Intensity | b3e4 | 15259.0 | 1193 | 1193 | 1193 |
| 14 | b3e4_254 | C | b3e4_254-C | Intensity | b3e4 | 15127.0 | 1209 | 1209 | 1209 |
| 15 | b3e4_254 | D | b3e4_254-D | Intensity | b3e4 | 15312.0 | 1471 | 1471 | 1471 |
| 16 | b4e3_nCL | A | b4e3_nCL-A | Intensity | b4e3 | 261.0 | 54 | 54 | 54 |
| 17 | b4e3_nCL | B | b4e3_nCL-B | Intensity | b4e3 | 263.0 | 58 | 58 | 58 |
| 18 | b4e3_nCL | C | b4e3_nCL-C | Intensity | b4e3 | 275.0 | 64 | 64 | 64 |
| 19 | b4e3_nCL | D | b4e3_nCL-D | Intensity | b4e3 | 263.0 | 88 | 88 | 88 |
| 20 | b4e3_254 | A | b4e3_254-A | Intensity | b4e3 | 14110.0 | 1303 | 1303 | 1303 |
| 21 | b4e3_254 | B | b4e3_254-B | Intensity | b4e3 | 13672.0 | 1259 | 1259 | 1259 |
| 22 | b4e3_254 | C | b4e3_254-C | Intensity | b4e3 | 13632.0 | 1286 | 1286 | 1286 |
| 23 | b4e3_254 | D | b4e3_254-D | Intensity | b4e3 | 13798.0 | 1533 | 1533 | 1533 |
| 24 | b4e4_nCL | A | b4e4_nCL-A | Intensity | b4e4 | 13312.0 | 834 | 834 | 834 |
| 25 | b4e4_nCL | B | b4e4_nCL-B | Intensity | b4e4 | 13628.0 | 858 | 858 | 858 |
| 26 | b4e4_nCL | C | b4e4_nCL-C | Intensity | b4e4 | 13209.0 | 847 | 847 | 847 |
| 27 | b4e4_nCL | D | b4e4_nCL-D | Intensity | b4e4 | 13925.0 | 1072 | 1072 | 1072 |
| 28 | b4e4_254 | B | b4e4_254-B | Intensity | b4e4 | 11277.0 | 995 | 995 | 995 |
| 29 | b4e4_254 | C | b4e4_254-C | Intensity | b4e4 | 16097.0 | 1220 | 1220 | 1220 |
| 30 | b4e4_254 | D | b4e4_254-D | Intensity | b4e4 | 16280.0 | 1522 | 1522 | 1522 |
| 31 | b4e4_254 | F | b4e4_254-F | Intensity | b4e4 | 16250.0 | 1219 | 1219 | 1219 |
#### Plot the countssns.set_style('whitegrid')jvis.BarPlotByGroup_sbplot(metaStats, x_col = 'condition', y_col = 'Intensity', title = '# Genes Detected By Group', pal = set2_paired, ylabel = 'Unique Genes', errorbars = 'SEM')<Figure size 432x288 with 0 Axes>
xxxxxxxxxx13. Assess Replicate Correlation
Functions
jwrangle.MQ_getSliceByPrefix( )
jvis.showPearsonRegression_altair( )
The function jwrangle.MQ_getSliceByPrefix( ) provides a convenient means of extracting values of a specific group.
We can then use jvis.showPearsonRegression_altair( ) to perform pairwise comparisons between each member of those groups. This function is specifically applied to genes with shared intensities- genes exclusive to one sample or the other, represented by vertical or horizontal datapoints, are plotted but excluded from the pearson calculation.
#### Extract the intensity values as a dictionary where keys = groupsIntensity_Dict = jwrangle.MQ_getSliceByPrefix(pGroup_log2, metadata, 'Intensity', group = 'condition', add_col = None)Intensity_Dict.keys()dict_keys(['b3e3_nCL', 'b3e3_254', 'b4e3_nCL', 'b4e3_254', 'b3e4_nCL', 'b3e4_254', 'b4e4_nCL', 'b4e4_254'])
x
#### Check replicate consistency across all within group pairsjvis.showPearsonRegression_altair(Intensity_Dict['b3e3_nCL'], mark_color = set2_paired[2])jvis.showPearsonRegression_altair(Intensity_Dict['b3e3_254'], mark_color = set2_paired[2])jvis.showPearsonRegression_altair(Intensity_Dict['b4e3_nCL'], mark_color = set2_paired[4])jvis.showPearsonRegression_altair(Intensity_Dict['b4e3_254'], mark_color = set2_paired[4])jvis.showPearsonRegression_altair(Intensity_Dict['b3e4_nCL'], mark_color = set2_paired[6])jvis.showPearsonRegression_altair(Intensity_Dict['b3e4_254'], mark_color = set2_paired[6])jvis.showPearsonRegression_altair(Intensity_Dict['b4e4_nCL'], mark_color = set2_paired[8])jvis.showPearsonRegression_altair(Intensity_Dict['b4e4_254'], mark_color = set2_paired[8])xxxxxxxxxx14. Classify RBP
Functions
jinspect.MQ_getFrequencyByGroup()
jtest.MQ_applyClassifyRBP()
Before classifying our RBP we need to first tally the frequency with which each protein appears in each condition using jinspect.MQ_getFrequencyByGroup( )
Once done, we use jtest.MQ_applyClassifyRBP( ) to generate a dictionary from which each class can be reviewed or plotted.
#### Use the metadata and proteinGroups tables to count how many times a gene is identified in its group (/6). #### Here I demonstrate how we can count for all instances of Intensity, iBAQ and LFQ IntensitypGroup_Freq = jinspect.MQ_getFrequencyByGroup(pGroup_log2, metadata, 'iBAQ', group = 'condition')pGroup_Freq = jinspect.MQ_getFrequencyByGroup(pGroup_Freq, metadata, 'LFQ intensity', group = 'condition')pGroup_Freq = jinspect.MQ_getFrequencyByGroup(pGroup_Freq, metadata, 'Intensity', group = 'condition')pGroup_Freq[['ENTREZGENE_gPro primary'] + [i for i in pGroup_Freq.columns if 'Freq' in i]].head(2)| ENTREZGENE_gPro primary | iBAQ Freq: b3e3_nCL | iBAQ Freq: b3e3_254 | iBAQ Freq: b4e3_nCL | iBAQ Freq: b4e3_254 | iBAQ Freq: b3e4_nCL | iBAQ Freq: b3e4_254 | iBAQ Freq: b4e4_nCL | iBAQ Freq: b4e4_254 | LFQ intensity Freq: b3e3_nCL | ... | LFQ intensity Freq: b4e4_nCL | LFQ intensity Freq: b4e4_254 | Intensity Freq: b3e3_nCL | Intensity Freq: b3e3_254 | Intensity Freq: b4e3_nCL | Intensity Freq: b4e3_254 | Intensity Freq: b3e4_nCL | Intensity Freq: b3e4_254 | Intensity Freq: b4e4_nCL | Intensity Freq: b4e4_254 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | HDLBP | 0 | 4 | 0 | 4 | 0 | 4 | 0 | 4 | 0 | ... | 0 | 4 | 0 | 4 | 0 | 4 | 0 | 4 | 0 | 4 |
| 1 | RPS9 | 0 | 4 | 0 | 4 | 4 | 4 | 1 | 4 | 0 | ... | 1 | 4 | 0 | 4 | 0 | 4 | 4 | 4 | 1 | 4 |
2 rows × 25 columns
x
#### Now we can classify RBP; the console will report if all proteins/genes have been properly classified or not.RBP_Dict_b3e3 = jtest.MQ_applyClassifyRBP(pGroup_Freq, 'LFQ intensity', metadata, 'b3e3', 'b3e3_nCL', 'b3e3_254', 3, add_cols = ['ENTREZGENE_gPro all', 'ENTREZGENE_gPro primary', 'ENTREZGENE_gPro name'])All proteins classified = True
x
#### Now we can classify RBP; the console will report if all proteins/genes have been properly classified or not.RBP_Dict_b3e4 = jtest.MQ_applyClassifyRBP(pGroup_Freq, 'LFQ intensity', metadata, 'b3e4', 'b3e4_nCL', 'b3e4_254', 3, add_cols = ['ENTREZGENE_gPro all', 'ENTREZGENE_gPro primary', 'ENTREZGENE_gPro name'])All proteins classified = True
x
#### Now we can classify RBP; the console will report if all proteins/genes have been properly classified or not.RBP_Dict_b4e3 = jtest.MQ_applyClassifyRBP(pGroup_Freq, 'LFQ intensity', metadata, 'b4e3', 'b4e3_nCL', 'b4e3_254', 3, add_cols = ['ENTREZGENE_gPro all', 'ENTREZGENE_gPro primary', 'ENTREZGENE_gPro name'])All proteins classified = True
x
#### Now we can classify RBP; the console will report if all proteins/genes have been properly classified or not.RBP_Dict_b4e4 = jtest.MQ_applyClassifyRBP(pGroup_Freq, 'LFQ intensity', metadata, 'b4e4', 'b4e4_nCL', 'b4e4_254', 3, add_cols = ['ENTREZGENE_gPro all', 'ENTREZGENE_gPro primary', 'ENTREZGENE_gPro name'])All proteins classified = True
#### The results of this classifcation can be found in a dictionary of dataframes for each output#### See help() for an explanation of each datasetRBP_Dict_b3e3.keys()dict_keys(['Input_df_annStatus', 'Summary_df_annStatus', 'I', 'IIa', 'IIb', 'IIc', 'NC', 'ND'])
x
#### We want the most general overview which of our classes per treatment and so concatenate the results from each Summary_df_annStatusRBP_Class = pd.concat([RBP_Dict_b3e3['Summary_df_annStatus'], RBP_Dict_b3e4['Summary_df_annStatus'], RBP_Dict_b4e3['Summary_df_annStatus'], RBP_Dict_b4e4['Summary_df_annStatus']], axis=0, join='outer')#### And represent them in a barplotfrom matplotlib import pyplot as pltplt.figure('rbp class')sns.set_style('whitegrid')sns.set(font_scale=1.1)ax = sns.countplot(x='MQgroup', hue = 'RBP Class', data=RBP_Class.sort_values(by = ['MQgroup', 'RBP Class']), palette = cpal['RBP_Class'], edgecolor = 'black')ax.set_ylabel('Unique Gene Count')ax.set_title('Unique Genes detected per RBP Class')RBP_Class.head(1)| ENTREZGENE_gPro all | ENTREZGENE_gPro primary | ENTREZGENE_gPro name | Gene names | RBP Class | RBP subClass | MQgroup | |
|---|---|---|---|---|---|---|---|
| 0 | HDLBP;nan | HDLBP | high density lipoprotein binding protein [Sour... | HDLBP | I | b3e3 |
x
#### Because each subclass of the class II RBP are identified based on different statistical assumptions we can more closely inspect if those assumptions trend differently among the different treatments.plt.figure('rbp class')sns.set_style('whitegrid')sns.set(font_scale=1.1)ax = sns.countplot(x='MQgroup', hue = 'RBP subClass', data=RBP_Class[RBP_Class['RBP subClass']!=''].sort_values(by=['MQgroup','RBP subClass']), palette = cpal['Set3_qual'], edgecolor = 'black')ax.set_ylabel('Unique Gene Count')ax.set_title('Unique Genes detected per RBP Class')RBP_Class.head(1)| ENTREZGENE_gPro all | ENTREZGENE_gPro primary | ENTREZGENE_gPro name | Gene names | RBP Class | RBP subClass | MQgroup | |
|---|---|---|---|---|---|---|---|
| 0 | HDLBP;nan | HDLBP | high density lipoprotein binding protein [Sour... | HDLBP | I | b3e3 |
x
## 15. Compare RBP Identity Between Ethanol Treatments Here we use the dictionaries output by __jtest.MQ_applyClassifyRBP( )__ for the purposes of creating a Venn Diagram. 15. Compare RBP Identity Between Ethanol Treatments
Here we use the dictionaries output by jtest.MQ_applyClassifyRBP( ) for the purposes of creating a Venn Diagram.
xxxxxxxxxxfrom matplotlib import pyplot as pltimport numpy as npfrom matplotlib_venn import venn3, venn3_circlesplt.figure(figsize=(6,6))v = venn3([set(RBP_Dict_b3e4['NC']['ENTREZGENE_gPro primary']), set(RBP_Dict_b3e3['I']['ENTREZGENE_gPro primary']), set(RBP_Dict_b4e3['I']['ENTREZGENE_gPro primary'])], set_labels = ('b3e4', 'b3e3', 'b4e3'), alpha = 0.5)c = venn3_circles([set(RBP_Dict_b3e4['NC']['ENTREZGENE_gPro primary']), set(RBP_Dict_b3e3['I']['ENTREZGENE_gPro primary']), set(RBP_Dict_b4e3['I']['ENTREZGENE_gPro primary'])], linestyle='dashed')c[0].set_lw(1.5)c[0].set_ls('dotted')c[1].set_lw(1.5)c[1].set_ls('dotted')c[2].set_lw(1.5)c[2].set_ls('dotted')for text in v.set_labels: text.set_fontsize(18)for text in v.subset_labels: text.set_fontsize(14)x
## 16. Review Basic Gene Ontology __Functions__ jwrangle.importMixedFiles( ) jweb.fetchQuickGO_stats( ) In this section we will explore Gene Ontology (GO) memberships for the observed proteins. There is little use in applying statistical tests such as Gene Ontology Enrichment Analysis (GOEA) for these experiments; the combination of selection by RNA interaction and the comparative lack of deep RBP validation for many candidates would make such a study rather spurious. We can, however, investigate the frequency with which our identified RBPs appear in previous studies. In addition, we can use this frequency to further assess whether conditions b3 and b4 are equivalent or not. A number of GO-specific and utility functions are provided to help with retrieving Gene Ontologies from the QuickGo database. In this section we'll look at the most basic. The function __jweb.fetchQuickGO_stats( )__ will fetch the annotation statistics for all records belonging to the gene ID from a submitted list. Reviewing these statistics before beginning an analysis is ideal, because it contextualises the breadth of future analyses. This function returns these statistics in the from of a dictionary which can be converted to a dataframe for easier viewing by using the dedicated function __jweb.getQuickGO_stats( )__. In addition to contextualising the search space of subsequent analyses, fetching the annotation numbers is important for checking that the number of records, per GO ID, falls below 10000. This is because QuickGo will not allow larger searches to be done programmatically. If your GO ID of interest has many more records users should retrieve their records manually. These details will be covered fourther in the next section.16. Review Basic Gene Ontology
Functions
jwrangle.importMixedFiles( )
jweb.fetchQuickGO_stats( )
In this section we will explore Gene Ontology (GO) memberships for the observed proteins. There is little use in applying statistical tests such as Gene Ontology Enrichment Analysis (GOEA) for these experiments; the combination of selection by RNA interaction and the comparative lack of deep RBP validation for many candidates would make such a study rather spurious. We can, however, investigate the frequency with which our identified RBPs appear in previous studies. In addition, we can use this frequency to further assess whether conditions b3 and b4 are equivalent or not.
A number of GO-specific and utility functions are provided to help with retrieving Gene Ontologies from the QuickGo database. In this section we'll look at the most basic.
The function jweb.fetchQuickGO_stats( ) will fetch the annotation statistics for all records belonging to the gene ID from a submitted list. Reviewing these statistics before beginning an analysis is ideal, because it contextualises the breadth of future analyses. This function returns these statistics in the from of a dictionary which can be converted to a dataframe for easier viewing by using the dedicated function jweb.getQuickGO_stats( ).
In addition to contextualising the search space of subsequent analyses, fetching the annotation numbers is important for checking that the number of records, per GO ID, falls below 10000. This is because QuickGo will not allow larger searches to be done programmatically. If your GO ID of interest has many more records users should retrieve their records manually. These details will be covered fourther in the next section.
xxxxxxxxxx#### I like to keep a table of interesting GO terms in a local csv file, let's find itMyResources = jwrangle.importMixedFiles(base_path / 'my_resources')MyResources.keys()dict_keys(['.ipynb_checkpoints', 'control_elements', 'control_proteins', 'custom_dfs', 'FASTA', 'GOexplore.ipynb', 'GO_TermsOfInterest.csv', 'GO_TermsOfInterest_stats.csv', 'Hentze_core_RBP_list.txt', 'Hentze_core_RBP_list_uniprot.txt', 'Hentze_core_RBP_plusE1_8.txt', 'instrumentQC', 'limma_normalizeTIs_p3550_p3592_p3657.nb.html', 'limma_normalizeTIs_p3550_p3592_p3657.Rmd', 'TIs_p3550_p3592_p3657.csv', 'TIs_p3550_p3592_p3657_loess.csv'])
MyResources['GO_TermsOfInterest.csv'].head(9)| GO Term | relationship | depth | go ID | Description | ctrl list type | parent(s) | |
|---|---|---|---|---|---|---|---|
| 0 | binding | Molecular Function | 1 | GO:0005488 | NaN | NaN | GO:0003674 |
| 1 | protein binding | Molecular Function | 2 | GO:0005515 | NaN | NaN | GO:0005488 |
| 2 | chromatin binding | Molecular Function | 2 | GO:0003682 | NaN | NaN | GO:0005488 |
| 3 | nucleic acid binding | Molecular Function | 3 | GO:0003676 | NaN | NaN | GO:0003676 |
| 4 | DNA binding | Molecular Function | 4 | GO:0003677 | NaN | general | GO:0003676 |
| 5 | RNA binding | Molecular Function | 4 | GO:0003723 | NaN | general | GO:0003676 |
| 6 | DNA/RNA hybrid binding | Molecular Function | 4 | GO:0071667 | NaN | general | GO:0003676 |
| 7 | translation regulator activity | Molecular Function | 4 | GO:0090079 | NaN | NaN | GO:0003676 |
| 8 | regulatory region nucleic acid binding | Molecular Function | 4 | GO:0001067 | NaN | NaN | GO:0003676 |
xxxxxxxxxx#### Let's get basic statistics on the broader RNA-binding category "GO:0003723" MyGO_Stats_Dict = jweb.fetchQuickGO_stats(['GO:0003723'], QG_geneProductType = 'protein', QG_taxonId = '9606', QG_geneProductSubset = ['Swiss-Prot', 'TrEMBL'])MyGO_Stats_Dict.keys()dict_keys(['GO0003723_Swiss-Prot', 'GO0003723_TrEMBL'])
#### To view basic information, like the number of annotations associated with the GO termRNA_Binding_stats = jweb.getQuickGO_stats(MyGO_Stats_Dict)RNA_Binding_stats| go ID | Swiss-Prot_annotations | Swiss-Prot_genesProducts | TrEMBL_annotations | TrEMBL_genesProducts | |
|---|---|---|---|---|---|
| 0 | GO:0003723 | 5321 | 1689 | 5439 | 2807 |
xxxxxxxxxxResult: GOOD
The SwissProt and TrEMBL counts are below 10000 records so, conveniently, we can continue the tutorial with the simplest scenario for record extraction.
xxxxxxxxxx17. Retrieve GO Records
Functions
jweb.fetchQuickGO( )
jwrangle.concatGO_DataFrameDict( )
jweb.mapQuickGO( )
jwrangle.AnnotateDataFrameCtrls( )
jinspect.MQ_getFrequencyBySample( )
We can see above that none of our downloaded IDs have exceeded our fetch limit of 10000 records. If any did, we would need to manually retrieve the tsv file.
For the GO terms we wish to investigate to investigate jweb.fetchQuickGO( ) will fetch the relevant proteins and also apply the same onsistent gene ID mapping algorithm we used earlier when remapping the MaxQuant IDs. This process is important because it allows us to provide consistent gene IDs to subsequent sets analysis. Where the writetopath options is used, the results from any searches will be saved to a local downloads folder- this is useful as a snapshot of the annotations used at the time or as a simple way of avoiding fetching the records via API in the future.
The function jwrangle.concatGO_DataFrameDict( ) let's us concatenate a dictionary created by jweb.fetchQuickGO( ). This is useful when the user wishes to pool both SwissProt and TrEMBL records for a given
The function jwrangle.AnnotateDataFrameCtrls( ) taks our QuickGo records and annotates each protein or gene in a given dataframe (i.e. a proteinGroups table) for membership to the searched GO ID. The function jinspect.MQ_getFrequencyBySample( ) acts similarly to the peptide and gene count functions used earlier- it will sum the frequency of memberships to each GO catgeory and report the counts as a modified metadata table.
A Note on Manual Downloads
The jweb.fetchQuickGO( ) will also output to console the html address that one should use if a manual download if required; this address will encompass all the relevant record characteristics that typically used by this suite. Once this table has been downloaded, the function jweb.mapQuickGO( ) can be used to provide the same ID mapping service as performed in jweb.fetchQuickGO( ). The returned elements will also be manipulated into the same format.
#### Download and create a local copy of all the records associated with our GO query. QuickGo_dict = jweb.fetchQuickGO(['GO:0003723'], QG_geneProductType = 'protein', QG_taxonId = '9606', QG_geneProductSubset = ['Swiss-Prot', 'TrEMBL'], gConvertOrganism='hsapiens', writetopath= True)QuickGo_dict.keys()GO0003723_Swiss-Prot has 5321 records: https://www.ebi.ac.uk/QuickGO/annotations?goId=GO:0003723&taxonId=9606&taxonUsage=exact&goUsage=descendants&geneProductType=protein&geneProductSubset=Swiss-Prot GO0003723_TrEMBL has 5439 records: https://www.ebi.ac.uk/QuickGO/annotations?goId=GO:0003723&taxonId=9606&taxonUsage=exact&goUsage=descendants&geneProductType=protein&geneProductSubset=TrEMBL
dict_keys(['GO0003723_Swiss-Prot', 'GO0003723_TrEMBL'])
xxxxxxxxxx### We'll assess our RBP frequency by considering both SwissProt and TrEmbl records, thus let's concatenate the dictionaries for GO:0003723QuickGo_dict_concat = jwrangle.concatGO_DataFrameDict(QuickGo_dict)QuickGo_dict_concat.keys()dict_keys(['GO0003723'])
#### With our GO records in hand we next investigate whether our identified proteins are members of GO:OOO3723#### This can be done with any DataFrame, but here we'll use the last version of our modified proteinGroups table pGroup_GO = jwrangle.AnnotateDataFrameCtrls(pGroup_Freq, QuickGo_dict_concat, search_match = 'ENTREZGENE_gPro primary', dict_match = 'ENTREZGENE_gPro primary', none_col = 'GO_None')pGroup_GO.keys()D:\MEGA\Programming\Scripts_JS\RBP_SUITE\modules\jwrangle.py:40: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy ctrl_nohits_df[none_col] = True #add zero annotation column to annotation columns
dict_keys(['ann_df', 'ann_subdf'])
xxxxxxxxxx#### The ann_df contains the master dataframe. The sub_df contains a slice for each positively identified Gene in each GO category searched.#### Let's view what those membership annotations look likepGroup_GO['ann_df'][['ENTREZGENE_gPro primary']+[i for i in list(pGroup_GO['ann_df'].columns) if 'GO' in i]].head(2)| ENTREZGENE_gPro primary | GO0003723 | GO_None | |
|---|---|---|---|
| 0 | HDLBP | True | False |
| 1 | RPS9 | True | False |
xxxxxxxxxx#### Now to find out the counts for each GO ID being searched we will modify our metadata table in a fashion in the same way as for prptide and gene counting.#### We'll use iBAQ for these counts but, given all intensities have previously been filtered by LFQ membership both 'Intensity' or 'LFQ intensity' would give the same result.metaStatsGo = jinspect.MQ_getFrequencyBySample(pGroup_GO['ann_df'], metaStats, freqList = list(QuickGo_dict_concat.keys()) + ['GO_None'], measure = 'iBAQ')metaStatsGo.iloc[:,1:]| condition | replicate | sample | measure | MQgroups | Peptides | Intensity | iBAQ | LFQ intensity | GO0003723 | GO_None | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | b3e3_nCL | A | b3e3_nCL-A | Intensity | b3e3 | 180.0 | 31 | 31 | 31 | 23 | 8 |
| 1 | b3e3_nCL | B | b3e3_nCL-B | Intensity | b3e3 | 156.0 | 32 | 32 | 32 | 24 | 8 |
| 2 | b3e3_nCL | C | b3e3_nCL-C | Intensity | b3e3 | 155.0 | 50 | 50 | 50 | 36 | 14 |
| 3 | b3e3_nCL | D | b3e3_nCL-D | Intensity | b3e3 | 125.0 | 51 | 51 | 51 | 32 | 19 |
| 4 | b3e3_254 | A | b3e3_254-A | Intensity | b3e3 | 13799.0 | 1255 | 1255 | 1255 | 884 | 371 |
| 5 | b3e3_254 | B | b3e3_254-B | Intensity | b3e3 | 13735.0 | 1253 | 1253 | 1253 | 890 | 363 |
| 6 | b3e3_254 | C | b3e3_254-C | Intensity | b3e3 | 13854.0 | 1285 | 1285 | 1285 | 897 | 388 |
| 7 | b3e3_254 | D | b3e3_254-D | Intensity | b3e3 | 13928.0 | 1528 | 1528 | 1528 | 970 | 558 |
| 8 | b3e4_nCL | A | b3e4_nCL-A | Intensity | b3e4 | 9859.0 | 697 | 697 | 697 | 312 | 385 |
| 9 | b3e4_nCL | B | b3e4_nCL-B | Intensity | b3e4 | 11362.0 | 751 | 751 | 751 | 318 | 433 |
| 10 | b3e4_nCL | C | b3e4_nCL-C | Intensity | b3e4 | 11206.0 | 758 | 758 | 758 | 337 | 421 |
| 11 | b3e4_nCL | D | b3e4_nCL-D | Intensity | b3e4 | 11832.0 | 961 | 961 | 961 | 381 | 580 |
| 12 | b3e4_254 | A | b3e4_254-A | Intensity | b3e4 | 15220.0 | 1177 | 1177 | 1177 | 760 | 417 |
| 13 | b3e4_254 | B | b3e4_254-B | Intensity | b3e4 | 15259.0 | 1193 | 1193 | 1193 | 770 | 423 |
| 14 | b3e4_254 | C | b3e4_254-C | Intensity | b3e4 | 15127.0 | 1209 | 1209 | 1209 | 777 | 432 |
| 15 | b3e4_254 | D | b3e4_254-D | Intensity | b3e4 | 15312.0 | 1471 | 1471 | 1471 | 856 | 615 |
| 16 | b4e3_nCL | A | b4e3_nCL-A | Intensity | b4e3 | 261.0 | 54 | 54 | 54 | 39 | 15 |
| 17 | b4e3_nCL | B | b4e3_nCL-B | Intensity | b4e3 | 263.0 | 58 | 58 | 58 | 43 | 15 |
| 18 | b4e3_nCL | C | b4e3_nCL-C | Intensity | b4e3 | 275.0 | 64 | 64 | 64 | 48 | 16 |
| 19 | b4e3_nCL | D | b4e3_nCL-D | Intensity | b4e3 | 263.0 | 88 | 88 | 88 | 56 | 32 |
| 20 | b4e3_254 | A | b4e3_254-A | Intensity | b4e3 | 14110.0 | 1303 | 1303 | 1303 | 901 | 402 |
| 21 | b4e3_254 | B | b4e3_254-B | Intensity | b4e3 | 13672.0 | 1259 | 1259 | 1259 | 887 | 372 |
| 22 | b4e3_254 | C | b4e3_254-C | Intensity | b4e3 | 13632.0 | 1286 | 1286 | 1286 | 901 | 385 |
| 23 | b4e3_254 | D | b4e3_254-D | Intensity | b4e3 | 13798.0 | 1533 | 1533 | 1533 | 974 | 559 |
| 24 | b4e4_nCL | A | b4e4_nCL-A | Intensity | b4e4 | 13312.0 | 834 | 834 | 834 | 341 | 493 |
| 25 | b4e4_nCL | B | b4e4_nCL-B | Intensity | b4e4 | 13628.0 | 858 | 858 | 858 | 347 | 511 |
| 26 | b4e4_nCL | C | b4e4_nCL-C | Intensity | b4e4 | 13209.0 | 847 | 847 | 847 | 348 | 499 |
| 27 | b4e4_nCL | D | b4e4_nCL-D | Intensity | b4e4 | 13925.0 | 1072 | 1072 | 1072 | 413 | 659 |
| 28 | b4e4_254 | B | b4e4_254-B | Intensity | b4e4 | 11277.0 | 995 | 995 | 995 | 672 | 323 |
| 29 | b4e4_254 | C | b4e4_254-C | Intensity | b4e4 | 16097.0 | 1220 | 1220 | 1220 | 790 | 430 |
| 30 | b4e4_254 | D | b4e4_254-D | Intensity | b4e4 | 16280.0 | 1522 | 1522 | 1522 | 902 | 620 |
| 31 | b4e4_254 | F | b4e4_254-F | Intensity | b4e4 | 16250.0 | 1219 | 1219 | 1219 | 786 | 433 |
#### To plot these countsjvis.BarPlotByGroup_sbplot(metaStatsGo, x_col = 'condition', y_col = 'GO0003723', title = 'GO:0003723, RNA-Binding', pal = set2_paired)<Figure size 432x288 with 0 Axes>
#### Calculate the % GO0003723 members per groupmetaStatsGo['% RBP'] = metaStatsGo.apply(lambda row: round(100*row['GO0003723']/(row['GO0003723']+row['GO_None'])), axis = 1)#### Plotjvis.BarPlotByGroup_sbplot(metaStatsGo, x_col = 'condition', y_col = '% RBP', title = 'GO:0003723, RNA-Binding', pal = set2_paired, yrange = [0,100])<Figure size 432x288 with 0 Axes>
- e296_Sequence-Agnostic RNP Purification_redact.ipynb
- e296_Sequence-Agnostic RNP Purification_redact - Copy.ipynb
xxxxxxxxxx200826 Expt.296: Sequence-Agnostic RNP Purification
NB: This demonstration analysis was performed on unpublished development data. The original experiment was designed to investigate the conditions under which RBP-RNA complexes could be purified without hybridisation. The data shown here is only one part of 4 datasets in the experiment (including the protocol which worked)
Aims:
A. To investigate different buffer compositions and elution conditions that may reveal a capacity for the purification of RBP-RNA complexes via selective precipitation.
Conditions
Buffer 3 Elution 3 (b3e3)
Buffer 3 Elution 4 (b3e4)
Buffer 4 Elution 3 (b4e3)
Buffer 4 Elution 4 (b4e4)
xxxxxxxxxx1. Import Modules and Files
Custom Functions
jwrangle.importMixedFiles( )
I generally import everything I MIGHT use at the start and set up pathing using the OS-agnostic pathlib.
#### Import necessary modulesimport osimport pandas as pdfrom pathlib import Pathimport copyimport jwrangleimport jvisimport jinspectimport jtestimport jwebfrom imp import reloadimport upsetplot as upsetfrom Bio import SeqIOimport altair as altimport numpy as npimport preprocessimport seaborn as sns#### define working directoriescwd = Path(os.getcwd())base_path = Path(os.path.join(*cwd.parts[:cwd.parts.index('experiments')]))#### MaxQuant proteinGroups & evidence filesMQ_folder = jwrangle.importMixedFiles(cwd / 'MaxQuant')MQ_folder.keys()pGroups = MQ_folder['proteinGroups.txt']evidence = MQ_folder['evidence.txt']#### Inspect MQ setupMQ_folder['parameters.txt'].head(9)C:\Users\smith.j\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3242: DtypeWarning: Columns (53,54,61) have mixed types.Specify dtype option on import or set low_memory=False. if (await self.run_code(code, result, async_=asy)): C:\Users\smith.j\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3242: DtypeWarning: Columns (321,322) have mixed types.Specify dtype option on import or set low_memory=False. if (await self.run_code(code, result, async_=asy)):
| Parameter | Value | |
|---|---|---|
| 0 | Version | 1.6.7.0 |
| 1 | User name | smith.j |
| 2 | Machine name | MSCYPHER-253 |
| 3 | Date of writing | 08/26/2020 20:00:00 |
| 4 | Include contaminants | True |
| 5 | PSM FDR | 0.01 |
| 6 | PSM FDR Crosslink | 0.01 |
| 7 | Protein FDR | 0.01 |
| 8 | Site FDR | 0.01 |
xxxxxxxxxx2. Metadata Creation
Custom Functions
jwrangle.MQ_writeMetadata( )
Metadata tabulates the test conditions for ALL experiments that shared the same MQ search and thuss all experiments that comprise the MQ outputs. Metadata can also be done in a spreadsheet program. Here, I have created my metadata programmatically instead (I simply find this easier).
The metadata table gives users the opportunity to rename samples and define the experimental parameters for the data. This task can be expecially complex for MaxQuant because a unified output is generated even if distinctly separate experiments are searched as a batch and with different parameters applied.
The function jwrangle.MQ_writeMetadata( ) will take a metadata table, rename all samples in the proteinGroups and evidence files, assign alternative filenames, and save new copies to be used in future analyses.
xxxxxxxxxx### Metadata and relabeling done previouslyxxxxxxxxxx3. Re-Annotate the MaxQuant pGroups with stable Gene IDs
Functions
jweb.mapAnyID( )
jwrangle.importMixedFiles( )
MaxQuant does a good job of assigning a Gene name to each protein group. Presumably these gene names come from the FASTA. However:
- Sometimes it fails to find a gene name
- Sometimes it will assign an ID that is not a gene or include out-of-place characters
- It doesn't always seem to be consistent
- If the gene name originates from the FASTA then repeating the MQ search with an updated FASTA is the only way to update the gene IDs.
- Use of a mapping service will standardise the ID conversion practices between my datasets and those of others, including RNA-Seq.
To avoid these problems we will remap the Majority protein IDs to ENTREZ gene IDs. jweb.mapAnyID( ) will retrieve all possible genes for each protein group, and will also select a primary ID to singularly represent the group by a consistent method. This is a very flexible function, see help( ) for further explanation. From this point, the MQ 'Gene names' column will no longer be necessary. This function can also handle ID mapping to and from almost any convention.
Ensuring our proteins have a consistent gene naming strategy is essential for inter-experiment comparison and the later use of set methods. It also creates a standard that can be applied for accurately mapping RNA-Seq results and thus aid in future mapping of protein-RNA partners.
#### If not already loaded, read in the metadata-adjusted filesmetadata = pd.read_csv(cwd / 'metadata' / 'e296r_metadata.csv', index_col = 0)pGroups = pd.read_csv(cwd / 'MaxQuant' / 'e296r_proteinGroups_metalabeled.txt', delimiter = '\t')evidence = pd.read_csv(cwd / 'MaxQuant' / 'e296r_evidence_metalabeled.txt', delimiter = '\t')C:\Users\smith.j\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3051: DtypeWarning: Columns (321,322) have mixed types.Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result) C:\Users\smith.j\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3051: DtypeWarning: Columns (53,54,61) have mixed types.Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)
#### Dynamically remap gene names in our proteinGroups file and save a copypGroups_remap = jweb.mapAnyID_gPro(pGroups['Majority protein IDs'].tolist(), splitstr = [';', '-'], geneProductType = 'protein', gConvertOrganism = 'hsapiens', gConvertTarget = 'ENTREZGENE', writetopath = [cwd, 'pGroups_remap'], writeTargetsAsList = 'NO')#### If not already loaded, read in the remapped proteinGroups filepGroups_remap = jwrangle.importMixedFiles(cwd / 'downloads' / 'pGroups_remap', dropSuffix = 'yes')pGroups_remap.keys()dict_keys(['id_map', 'query_map'])
#### jwrangle.importMixedFiles() returns a dictionary where keys = files. We want the 'id_map' table created by jweb.mapAnyID_gPro().#### We'll rename the Query column and drop duplicates so the table can be merged with our proteinGroups table.id_map = pGroups_remap['id_map'].rename(columns={'Query': 'Majority protein IDs'}).drop_duplicates()#### Now use merge to add these new columns to our proteinGroups tablepGroups_map = pd.merge(pGroups, id_map, on='Majority protein IDs', how='left')#### Check the tables are merged by viewing column elements from each.pGroups_map[id_map.columns.tolist() + ['Peptide IDs']].head(2)| Majority protein IDs | ENTREZGENE_gPro all | ENTREZGENE_gPro primary | ENTREZGENE_gPro name | UNIPROT_gPro status | Peptide IDs | |
|---|---|---|---|---|---|---|
| 0 | A0A024R4E5;Q00341;Q00341-2;H0Y394 | HDLBP;nan | HDLBP | high density lipoprotein binding protein [Sour... | SWISSPROT | 57;711;1649;1671;1940;2102;2775;2824;3106;3440... |
| 1 | A0A024R4M0;P46781;B5MCT8;C9JM19 | RPS9 | RPS9 | ribosomal protein S9 [Source:HGNC Symbol;Acc:H... | SWISSPROT | 6085;6086;10517;11131;11450;13657;14430;14702;... |
xxxxxxxxxx4. Review Contaminants by Sample
Functions
jinspect.MQ_getContaminants( )
MQ_getContaminants_sbplot( )
jwrangle.importMixedFiles( )
We can extract the conaminants from our proteinGroups file using jinspect.MQ_getContaminants( ). These extracted table will return log2(iBAQ values).
Contaminants can then be reviewed with MQ_getContaminants_sbplot( ).
#### Extract contaminantscontaminants = jinspect.MQ_getContaminants(pGroups_map, metadata)#### Sort the metadata into a more intuitive ordermetadata_sort = metadata.sort_values(by = ['MQgroups','condition', 'replicate'], ascending=[True, False, True])#### Visually inspect contaminants jvis.MQ_getContaminants_sbplot(contaminants, metadata_sort, width = 3, length = 2, layout = 'single')<Figure size 432x288 with 0 Axes>
xxxxxxxxxx5. Assess Digestion Efficiency
Functions
jinspect.MQ_getMissedCleavages( )
jvis.CommonPalettesAsHex
jvis.BarPlotByGroup_sbplot( )
Assessing missed cleavages is an essential metric for understanding the quality of the tryptic digestion. This data is recorded in the evidence file.
jinspect.MQ_getMissedCleavages( )_ will return a long form data table that can easily be used for plotting. The dictionary jvis.CommonPalettesAsHex contains a number of palettes that are common to both matplotlib and ggplot (from R). These are provided to ensure consistency is easy to achieve across both languages. We'll plot the missed cleavages with the generic function jvis.BarPlotByGroup_sbplot( )
#### Extract the missed cleavage data into a long form table for plottingMissedCleavages = jinspect.MQ_getMissedCleavages(evidence, metadata, drop_contaminants = True)MissedCleavages.sort_values(by=['expt','sample'], inplace = True)MissedCleavages| sample | % Missed Cleavages | group | expt | |
|---|---|---|---|---|
| 10 | b3e3_254-A | 24 | b3e3_254 | b3e3 |
| 5 | b3e3_254-B | 24 | b3e3_254 | b3e3 |
| 14 | b3e3_254-C | 24 | b3e3_254 | b3e3 |
| 15 | b3e3_254-D | 24 | b3e3_254 | b3e3 |
| 12 | b3e3_nCL-A | 22 | b3e3_nCL | b3e3 |
| 16 | b3e3_nCL-B | 21 | b3e3_nCL | b3e3 |
| 20 | b3e3_nCL-C | 16 | b3e3_nCL | b3e3 |
| 31 | b3e3_nCL-D | 17 | b3e3_nCL | b3e3 |
| 21 | b3e4_254-A | 31 | b3e4_254 | b3e4 |
| 27 | b3e4_254-B | 32 | b3e4_254 | b3e4 |
| 23 | b3e4_254-C | 32 | b3e4_254 | b3e4 |
| 19 | b3e4_254-D | 32 | b3e4_254 | b3e4 |
| 3 | b3e4_nCL-A | 35 | b3e4_nCL | b3e4 |
| 28 | b3e4_nCL-B | 34 | b3e4_nCL | b3e4 |
| 2 | b3e4_nCL-C | 34 | b3e4_nCL | b3e4 |
| 11 | b3e4_nCL-D | 34 | b3e4_nCL | b3e4 |
| 29 | b4e3_254-A | 25 | b4e3_254 | b4e3 |
| 7 | b4e3_254-B | 25 | b4e3_254 | b4e3 |
| 13 | b4e3_254-C | 25 | b4e3_254 | b4e3 |
| 8 | b4e3_254-D | 26 | b4e3_254 | b4e3 |
| 26 | b4e3_nCL-A | 18 | b4e3_nCL | b4e3 |
| 0 | b4e3_nCL-B | 18 | b4e3_nCL | b4e3 |
| 9 | b4e3_nCL-C | 20 | b4e3_nCL | b4e3 |
| 6 | b4e3_nCL-D | 19 | b4e3_nCL | b4e3 |
| 30 | b4e4_254-B | 22 | b4e4_254 | b4e4 |
| 25 | b4e4_254-C | 33 | b4e4_254 | b4e4 |
| 17 | b4e4_254-D | 33 | b4e4_254 | b4e4 |
| 18 | b4e4_254-F | 32 | b4e4_254 | b4e4 |
| 1 | b4e4_nCL-A | 35 | b4e4_nCL | b4e4 |
| 24 | b4e4_nCL-B | 35 | b4e4_nCL | b4e4 |
| 22 | b4e4_nCL-C | 36 | b4e4_nCL | b4e4 |
| 4 | b4e4_nCL-D | 35 | b4e4_nCL | b4e4 |
### Select a colour palette cpal = jvis.CommonPalettesAsHexset2_paired = []for i in cpal['Set2_qual']: set2_paired.append(i) set2_paired.append(i)#### Plot the grouped data points sns.set_style('whitegrid')jvis.BarPlotByGroup_sbplot(MissedCleavages, x_col = 'group', y_col = '% Missed Cleavages', title = '% Missed Cleavages', pal = set2_paired)<Figure size 432x288 with 0 Axes>
xxxxxxxxxxResults: Average
Digestion efficiency isn't great. Let's reduce the digestion volume by 0.5x and maintain the same trypsin concentration in future experiments
xxxxxxxxxx6. Remove Contaminants
Functions
jwrangle.MQ_getThreePassFilter( )
SeqIO.parse( )
After QC we no longer want the contaminants in our data. jwrangle.MQ_getThreePassFilter( ) will remove reverse peptides, contaminants, and only identified by site from MQ tables.
The filter will also accept customised exclusion lists in case users have added odd protein species to the search FASTA tables. In this particular experiment we added to the human FASTA, RNAse proteins and the large T antigen. The former as 1) a check that dynamic range is not being overwhelmed and 2) as an quantitative spike-in control to compare tryptic efficiency and the sample recovery across samples following C18 cleanup.
#### Map the location of the custom FASTA elementsos.listdir(base_path / 'my_resources' / 'FASTA') #### Create a list of the non-human proteins that were added to the custom FASTA genome search. new_cont = []with open(base_path / 'my_resources' / 'FASTA' / "custom_proteome_elements.fasta", "r") as handle: for record in SeqIO.parse(handle, "fasta"): new_cont.append(record.id.split('|')[1])#### Remove all unwanted contaminants and IDs from the proteinGroups table pGroup_clean = jwrangle.MQ_getThreePassFilter(pGroups_map, custom_exclusion = new_cont)#### Inspect the cleaned dataframepGroup_clean[['ENTREZGENE_gPro primary'] + [i for i in pGroup_clean.columns if 'iBAQ' in i]].head(2)| ENTREZGENE_gPro primary | iBAQ | iBAQ b3e3_nCL-A | iBAQ b3e3_nCL-B | iBAQ b3e3_nCL-C | iBAQ b3e3_nCL-D | iBAQ b3e3_254-A | iBAQ b3e3_254-B | iBAQ b3e3_254-C | iBAQ b3e3_254-D | ... | iBAQ b3e4_254-C | iBAQ b3e4_254-D | iBAQ b4e4_nCL-A | iBAQ b4e4_nCL-B | iBAQ b4e4_nCL-C | iBAQ b4e4_nCL-D | iBAQ b4e4_254-B | iBAQ b4e4_254-F | iBAQ b4e4_254-C | iBAQ b4e4_254-D | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | HDLBP | 1.044400e+09 | 0.0 | 0.0 | 0.0 | 0.0 | 9.678600e+07 | 7.080900e+07 | 9.100200e+07 | 9.025000e+07 | ... | 60376000.0 | 50760000.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3174900.0 | 57683000.0 | 82764000.0 | 75414000.0 |
| 1 | RPS9 | 1.689600e+10 | 0.0 | 0.0 | 0.0 | 0.0 | 1.421100e+09 | 1.117600e+09 | 1.870200e+09 | 2.090800e+09 | ... | 399610000.0 | 324730000.0 | 993560.0 | 1235400.0 | 0.0 | 0.0 | 36901000.0 | 420100000.0 | 579340000.0 | 495440000.0 |
2 rows × 34 columns
xxxxxxxxxx7. Drop Gene Duplicates and Filter Intensities by LFQ
Functions
jinspect.MQ_dropDuplicateIDs( )
The next step focuses on improving confidence in the quality of our data. This is done by applying jinspect.MQ_dropDuplicateIDs( ) which has the below effects:
- Because one gene can have many proteins, sometimes Maxquant will create multiple proteinGroups for a single gene. As most of our analysis focuses on genes we'll trim the lowest quality proteinGroups duplicates from the table.
- Standard LFQ defaults require a minimum of 2 peptide species, at least one of which must be unique, for quantitation to be applied. Intensity and iBAQ values, however, do not have such a minimum limit. I consider a 2 peptide minimum to be a wise filter but still have use for the Intensity and iBAQ values. Thus where the LFQ filter is applied all measurements that do not meet the minimum limit will be discarded. In short, if there isn't a companion LFQ value, there won't be an Intensity or iBAQ value either after filtering.
- It has been documented that Match Between Runs suffers a high frequency of false peptide transfers (Lim, Paulo, Gygi 2019; doi: 10.1021/acs.jproteome.9b00492). At the protein level, however, this false transfer rate is greatly mitigated by the minimum peptide rule applied by the LFQ algorithm. This is another good reason for our filtering step.
#### Drop duplicates and apply LFQ filterfilter_dict = jinspect.MQ_dropDuplicateIDs(pGroup_clean, metadata, prefix = 'Peptides', ID = 'ENTREZGENE_gPro primary', pool = 'measure', drop_ID = 'None', keep_PoolCalcs = False, applyLFQ_filter = ['Intensity', 'iBAQ'])#### The df_keep value contains our targets, df_droprows conatins the discarded duplicates. Assign the df_keep value to a new variable and inspect.pGroup_filtered = filter_dict['df_keep']pGroup_filtered.head(2)WARNING: jinspect.MQ_getMeasuredMeansByGroup() has not been checked
| Protein IDs | Majority protein IDs | Peptide counts (all) | Peptide counts (razor+unique) | Peptide counts (unique) | Protein names | Gene names | Fasta headers | Number of proteins | Peptides | ... | iBAQ b4e3_nCL-C | iBAQ b4e3_nCL-D | iBAQ b4e4_254-B | iBAQ b4e4_254-C | iBAQ b4e4_254-D | iBAQ b4e4_254-F | iBAQ b4e4_nCL-A | iBAQ b4e4_nCL-B | iBAQ b4e4_nCL-C | iBAQ b4e4_nCL-D | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A0A024R4E5;Q00341;Q00341-2;H0Y394;H7C0A4;C9JIZ... | A0A024R4E5;Q00341;Q00341-2;H0Y394 | 56;55;50;39;18;15;13;11;10;10;10;10;9;6;6;5;5;... | 56;55;50;39;18;15;13;11;10;10;10;10;9;6;6;5;5;... | 56;55;50;39;18;15;13;11;10;10;10;10;9;6;6;5;5;... | Vigilin | HDLBP | ;;; | 23 | 56 | ... | 0.0 | 0.0 | 3174900.0 | 82764000.0 | 75414000.0 | 57683000.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | A0A024R4M0;P46781;B5MCT8;C9JM19;F2Z3C0;A8MXK4 | A0A024R4M0;P46781;B5MCT8;C9JM19 | 16;16;13;13;4;4 | 16;16;13;13;4;4 | 16;16;13;13;4;4 | 40S ribosomal protein S9 | RPS9 | ;;; | 6 | 16 | ... | 0.0 | 0.0 | 36901000.0 | 579340000.0 | 495440000.0 | 420100000.0 | 0.0 | 1235400.0 | 0.0 | 0.0 |
2 rows × 336 columns
xxxxxxxxxx8. Review Sample Clustering by Group
Functions
jtest.getDistanceMatrix( )
jvis.MQ_showDendrogramQC_mplplot( )
A distance matrix function jtest.getDistanceMatrix( ) is provided for users who wish to apply different algorithms or create different visualisations.
I like the 'ward' method for distance calculations and using a dengrogram to confirm that clustering matches expectations and so use a prerolled function jvis.MQ_showDendrogramQC_mplplot( )
#### Confirm that clustering matches expectationsjvis.MQ_showDendrogramQC_mplplot(pGroup_filtered, 'Intensity', metadata, 'QC clustering: ', grid = 'YES', fsize = (8, 8))xxxxxxxxxx9. Analyse Normalisation Effects by Sample
Functions
jwrangle.Log2_ByPrefix( )
jwrangle.MQ_poolMulti( )
jvis.ViolinCompare_sbplot( )
Here we review normalisation effects on each sample within the condition groups; these are most easily interpreted after log2 transformation. We will transform all measures of interest with jwrangle.Log2_ByPrefix( ) and then pool all the values of interest, by condition, with jwrangle.MQ_poolMulti( ). The function jvis.ViolinCompare_sbplot( ) will let use compare Intensity distribution on a per sample basis.
Normalisation is applied to LFQ values by MaxQuant and is a feature of its handling of label-free data. I've not seen a detailed explanation of how it works though so it is a leap of faith that Cox and Mann have selected an appropriate method.
Normalisation must be applied separately to nCL and cCL groups. This is unusual though necessary to avoid outrageous results caused by having groups with extreme differences. See expt.313 for evidence.
#### Log2 transform available intensity values.pGroup_log2 = jwrangle.Log2_ByPrefix(pGroup_filtered, 'LFQ intensity')pGroup_log2 = jwrangle.Log2_ByPrefix(pGroup_log2, 'iBAQ')pGroup_log2 = jwrangle.Log2_ByPrefix(pGroup_log2, 'Intensity')pGroup_log2.replace(0,np.nan, inplace=True)#### Create a long form dataset for each desired groupingpool_SampleIntensity = jwrangle.MQ_poolMulti(pGroup_log2, metadata, melt_list = ['Intensity', 'LFQ intensity'], group = 'condition')pool_SampleIntensity.keys()dict_keys(['b3e3_nCL', 'b3e3_254', 'b4e3_nCL', 'b4e3_254', 'b3e4_nCL', 'b3e4_254', 'b4e4_nCL', 'b4e4_254'])
#### Inspect the Intensity shifts generated by the LFQ normalisation algorithm. In MaxQuant Raw Intensity and iBAQ values are #### not subjected to the LFQ normalisation calculations so this is a good way to spot any gross violationssns.set_style('whitegrid')jvis.ViolinCompare_sbplot(pool_SampleIntensity['b3e3_nCL'], title = 'b3e3_nCL: Normalisation Effects', ylabel = 'Log2(Intensity)', palette = ['#ff6666', '#99ccff'])#### Inspect the Intensity shifts generated by the LFQ normalisation algorithm. In MaxQuant Raw Intensity and iBAQ values are #### not subjected to the LFQ normalisation calculations so this is a good way to spot any gross violationssns.set_style('whitegrid')jvis.ViolinCompare_sbplot(pool_SampleIntensity['b3e3_254'], title = 'b3e3_254: Normalisation Effects', ylabel = 'Log2(Intensity)', palette = ['#ff6666', '#99ccff'])#### Inspect the Intensity shifts generated by the LFQ normalisation algorithm. In MaxQuant Raw Intensity and iBAQ values are #### not subjected to the LFQ normalisation calculations so this is a good way to spot any gross violationssns.set_style('whitegrid')jvis.ViolinCompare_sbplot(pool_SampleIntensity['b3e4_nCL'], title = 'b3e4_nCL: Normalisation Effects', ylabel = 'Log2(Intensity)', palette = cpal['Set3_qual'])#### Inspect the Intensity shifts generated by the LFQ normalisation algorithm. In MaxQuant Raw Intensity and iBAQ values are #### not subjected to the LFQ normalisation calculations so this is a good way to spot any gross violationssns.set_style('whitegrid')jvis.ViolinCompare_sbplot(pool_SampleIntensity['b3e4_254'], title = 'b3e4_254: Normalisation Effects', ylabel = 'Log2(Intensity)', palette = cpal['Set3_qual'])#### Inspect the Intensity shifts generated by the LFQ normalisation algorithm. In MaxQuant Raw Intensity and iBAQ values are #### not subjected to the LFQ normalisation calculations so this is a good way to spot any gross violationssns.set_style('whitegrid')jvis.ViolinCompare_sbplot(pool_SampleIntensity['b4e3_nCL'], title = 'b4e3_nCL: Normalisation Effects', ylabel = 'Log2(Intensity)', palette = cpal['Set3_qual'][4:])#### Inspect the Intensity shifts generated by the LFQ normalisation algorithm. In MaxQuant Raw Intensity and iBAQ values are #### not subjected to the LFQ normalisation calculations so this is a good way to spot any gross violationssns.set_style('whitegrid')jvis.ViolinCompare_sbplot(pool_SampleIntensity['b4e3_254'], title = 'b4e3_254: Normalisation Effects', ylabel = 'Log2(Intensity)', palette = cpal['Set3_qual'][4:])#### Inspect the Intensity shifts generated by the LFQ normalisation algorithm. In MaxQuant Raw Intensity and iBAQ values are #### not subjected to the LFQ normalisation calculations so this is a good way to spot any gross violationssns.set_style('whitegrid')jvis.ViolinCompare_sbplot(pool_SampleIntensity['b4e4_nCL'], title = 'b4e4_nCL: Normalisation Effects', ylabel = 'Log2(Intensity)', palette = cpal['Set1_qual'][2:])#### Inspect the Intensity shifts generated by the LFQ normalisation algorithm. In MaxQuant Raw Intensity and iBAQ values are #### not subjected to the LFQ normalisation calculations so this is a good way to spot any gross violationssns.set_style('whitegrid')jvis.ViolinCompare_sbplot(pool_SampleIntensity['b4e4_nCL'], title = 'b4e4_nCL: Normalisation Effects', ylabel = 'Log2(Intensity)', palette = cpal['Set1_qual'][2:])xxxxxxxxxx10. Compare Intensity Distribution and Sequence Coverage
Functions
jwrangle.MQ_poolDataByCondition( )
jvis.BoxPlotByColumn_sbplot( )
Next we will compare intensity and sequence coverage between groups. Log2 transformation has already been performed so we need only use jwrangle.MQ_poolDataByCondition( ) to create the appropriate long form dataset for plotting with jvis.BoxPlotByColumn_sbplot( ).
#### Pool data into a single long form datasetpooled_dfDropGroupOne = jwrangle.MQ_poolDataByCondition(pGroup_log2, metadata_sort, prefix_list = ['Intensity', 'Sequence coverage'])#### Compare Intensity distribution using a box and whisker plotsns.set_style('whitegrid')jvis.BoxPlotByColumn_sbplot(pooled_dfDropGroupOne, 'Intensity: ', 'Intensity')#### Compare Sequence coverage using a box and whisker plotsns.set_style('whitegrid')jvis.BoxPlotByColumn_sbplot(pooled_dfDropGroupOne, 'Sequence coverage: ', 'Sequence coverage %')xxxxxxxxxx11. Compare Sum Peptide Counts
Functions
jinspect.MQ_getSumBySample( )
jvis.BarPlotByGroup_sbplot( )
To sum the total peptides observed across all proteins use jinspect.MQ_getSumBySample( ). These sums will be returned as a modified metadata table.
Plotting these by group is easily done with jvis.BarPlotByGroup_sbplot( ). The plotting order is determined by the metadata ordering.
In this case we are inspecting the number of peptides detected after having removed contaminants- thus if some spike-in proteins were removed, i.e. in this case RNAse treatments, they will not contribute to the peptide count. To look at the replicability of these spike-ins, we would reach back to the 'df_droprows' table generated by jinspect.MQ_dropDuplicateIDs( ) in section 7.
#### Extract the total peptides observed per samplemetaStats = jinspect.MQ_getSumBySample(pGroup_log2, metadata_sort, freqList = ['Peptides'], measure = False)#### Plot the sum peptidessns.set_style('whitegrid')jvis.BarPlotByGroup_sbplot(metaStats, x_col = 'condition', y_col = 'Peptides', title = 'Sum Peptides vs pH enrichment', pal = set2_paired, errorbars = 'SEM')<Figure size 432x288 with 0 Axes>
xxxxxxxxxx12. Compare Unique Gene Counts
Functions
jinspect.MQ_getFrequencyBySample( )
One gene can encode for many proteins that often share regions of similarity. As for illumina-based RNA-Seq, however, shotgun proteomics can rarely assign a peptide species to a singular protein. In MaxQuant these are called proteinGroups. Because we have do not require protein-specific results, and gene identity is more stable, our gene count describes the groups to which our detected proteins have been be assigned. Thus gene here is being detected by protein product, just as it would be detected by RNA product in RNA Seq; none of these 3 are synonymous. To be clear, this is a count and not a measure.
Gene frequency is defined by the summed observations per protein regardless of intensity value and this data is extracted to our modified metadata with jinspect.MQ_getFrequencyBySample( ) .
A typical MQ search will yield identical protein counts (though different values) for Intensity and iBAQ*. LFQ frequencies will vary depending on the search settings:
- In this case the MQ search has set LFQ values to be calculated on a min 2 peptide ratio (this is the default)**
Notes
* Why protein counts should be identical I don't know. The original iBAQ paper stipulates rules for the inclusion of a protein in the iBAQ calculation but MaxQuant doesn't seem to apply them.
** Previously I tested LFQ min ratio at 1 peptide. At 1 minimum peptide there was unexpected QC clustering. Possible explanations for this are explained in section 7 and are cleaned up by jinspect.MQ_dropDuplicateIDs( ) function. We can expect this function to greatly reduce qualifying IDs (~20% fewer), especially in the QE samples, but I think the trade-off is worth it because we gain 1) a more robust ID check and 2) the same search can be used for LFQ based checks of dynamic changes, i.e. comparing more than one group of cCL captures for biological changes.
#### Count the number of unique metaStats = jinspect.MQ_getFrequencyBySample(pGroup_log2, metaStats, freqList = ['Intensity', 'iBAQ', 'LFQ intensity'], measure = False)metaStats.iloc[:,1:]| condition | replicate | sample | measure | MQgroups | Peptides | Intensity | iBAQ | LFQ intensity | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | b3e3_nCL | A | b3e3_nCL-A | Intensity | b3e3 | 180.0 | 31 | 31 | 31 |
| 1 | b3e3_nCL | B | b3e3_nCL-B | Intensity | b3e3 | 156.0 | 32 | 32 | 32 |
| 2 | b3e3_nCL | C | b3e3_nCL-C | Intensity | b3e3 | 155.0 | 50 | 50 | 50 |
| 3 | b3e3_nCL | D | b3e3_nCL-D | Intensity | b3e3 | 125.0 | 51 | 51 | 51 |
| 4 | b3e3_254 | A | b3e3_254-A | Intensity | b3e3 | 13799.0 | 1255 | 1255 | 1255 |
| 5 | b3e3_254 | B | b3e3_254-B | Intensity | b3e3 | 13735.0 | 1253 | 1253 | 1253 |
| 6 | b3e3_254 | C | b3e3_254-C | Intensity | b3e3 | 13854.0 | 1285 | 1285 | 1285 |
| 7 | b3e3_254 | D | b3e3_254-D | Intensity | b3e3 | 13928.0 | 1528 | 1528 | 1528 |
| 8 | b3e4_nCL | A | b3e4_nCL-A | Intensity | b3e4 | 9859.0 | 697 | 697 | 697 |
| 9 | b3e4_nCL | B | b3e4_nCL-B | Intensity | b3e4 | 11362.0 | 751 | 751 | 751 |
| 10 | b3e4_nCL | C | b3e4_nCL-C | Intensity | b3e4 | 11206.0 | 758 | 758 | 758 |
| 11 | b3e4_nCL | D | b3e4_nCL-D | Intensity | b3e4 | 11832.0 | 961 | 961 | 961 |
| 12 | b3e4_254 | A | b3e4_254-A | Intensity | b3e4 | 15220.0 | 1177 | 1177 | 1177 |
| 13 | b3e4_254 | B | b3e4_254-B | Intensity | b3e4 | 15259.0 | 1193 | 1193 | 1193 |
| 14 | b3e4_254 | C | b3e4_254-C | Intensity | b3e4 | 15127.0 | 1209 | 1209 | 1209 |
| 15 | b3e4_254 | D | b3e4_254-D | Intensity | b3e4 | 15312.0 | 1471 | 1471 | 1471 |
| 16 | b4e3_nCL | A | b4e3_nCL-A | Intensity | b4e3 | 261.0 | 54 | 54 | 54 |
| 17 | b4e3_nCL | B | b4e3_nCL-B | Intensity | b4e3 | 263.0 | 58 | 58 | 58 |
| 18 | b4e3_nCL | C | b4e3_nCL-C | Intensity | b4e3 | 275.0 | 64 | 64 | 64 |
| 19 | b4e3_nCL | D | b4e3_nCL-D | Intensity | b4e3 | 263.0 | 88 | 88 | 88 |
| 20 | b4e3_254 | A | b4e3_254-A | Intensity | b4e3 | 14110.0 | 1303 | 1303 | 1303 |
| 21 | b4e3_254 | B | b4e3_254-B | Intensity | b4e3 | 13672.0 | 1259 | 1259 | 1259 |
| 22 | b4e3_254 | C | b4e3_254-C | Intensity | b4e3 | 13632.0 | 1286 | 1286 | 1286 |
| 23 | b4e3_254 | D | b4e3_254-D | Intensity | b4e3 | 13798.0 | 1533 | 1533 | 1533 |
| 24 | b4e4_nCL | A | b4e4_nCL-A | Intensity | b4e4 | 13312.0 | 834 | 834 | 834 |
| 25 | b4e4_nCL | B | b4e4_nCL-B | Intensity | b4e4 | 13628.0 | 858 | 858 | 858 |
| 26 | b4e4_nCL | C | b4e4_nCL-C | Intensity | b4e4 | 13209.0 | 847 | 847 | 847 |
| 27 | b4e4_nCL | D | b4e4_nCL-D | Intensity | b4e4 | 13925.0 | 1072 | 1072 | 1072 |
| 28 | b4e4_254 | B | b4e4_254-B | Intensity | b4e4 | 11277.0 | 995 | 995 | 995 |
| 29 | b4e4_254 | C | b4e4_254-C | Intensity | b4e4 | 16097.0 | 1220 | 1220 | 1220 |
| 30 | b4e4_254 | D | b4e4_254-D | Intensity | b4e4 | 16280.0 | 1522 | 1522 | 1522 |
| 31 | b4e4_254 | F | b4e4_254-F | Intensity | b4e4 | 16250.0 | 1219 | 1219 | 1219 |
#### Plot the countssns.set_style('whitegrid')jvis.BarPlotByGroup_sbplot(metaStats, x_col = 'condition', y_col = 'Intensity', title = '# Genes Detected By Group', pal = set2_paired, ylabel = 'Unique Genes', errorbars = 'SEM')<Figure size 432x288 with 0 Axes>
xxxxxxxxxx13. Assess Replicate Correlation
Functions
jwrangle.MQ_getSliceByPrefix( )
jvis.showPearsonRegression_altair( )
The function jwrangle.MQ_getSliceByPrefix( ) provides a convenient means of extracting values of a specific group.
We can then use jvis.showPearsonRegression_altair( ) to perform pairwise comparisons between each member of those groups. This function is specifically applied to genes with shared intensities- genes exclusive to one sample or the other, represented by vertical or horizontal datapoints, are plotted but excluded from the pearson calculation.
#### Extract the intensity values as a dictionary where keys = groupsIntensity_Dict = jwrangle.MQ_getSliceByPrefix(pGroup_log2, metadata, 'Intensity', group = 'condition', add_col = None)Intensity_Dict.keys()dict_keys(['b3e3_nCL', 'b3e3_254', 'b4e3_nCL', 'b4e3_254', 'b3e4_nCL', 'b3e4_254', 'b4e4_nCL', 'b4e4_254'])
#### Check replicate consistency across all within group pairsjvis.showPearsonRegression_altair(Intensity_Dict['b3e3_nCL'], mark_color = set2_paired[2])jvis.showPearsonRegression_altair(Intensity_Dict['b3e3_254'], mark_color = set2_paired[2])jvis.showPearsonRegression_altair(Intensity_Dict['b4e3_nCL'], mark_color = set2_paired[4])jvis.showPearsonRegression_altair(Intensity_Dict['b4e3_254'], mark_color = set2_paired[4])jvis.showPearsonRegression_altair(Intensity_Dict['b3e4_nCL'], mark_color = set2_paired[6])jvis.showPearsonRegression_altair(Intensity_Dict['b3e4_254'], mark_color = set2_paired[6])jvis.showPearsonRegression_altair(Intensity_Dict['b4e4_nCL'], mark_color = set2_paired[8])jvis.showPearsonRegression_altair(Intensity_Dict['b4e4_254'], mark_color = set2_paired[8])xxxxxxxxxx14. Classify RBP
Functions
jinspect.MQ_getFrequencyByGroup()
jtest.MQ_applyClassifyRBP()
Before classifying our RBP we need to first tally the frequency with which each protein appears in each condition using jinspect.MQ_getFrequencyByGroup( )
Once done, we use jtest.MQ_applyClassifyRBP( ) to generate a dictionary from which each class can be reviewed or plotted.
#### Use the metadata and proteinGroups tables to count how many times a gene is identified in its group (/6). #### Here I demonstrate how we can count for all instances of Intensity, iBAQ and LFQ IntensitypGroup_Freq = jinspect.MQ_getFrequencyByGroup(pGroup_log2, metadata, 'iBAQ', group = 'condition')pGroup_Freq = jinspect.MQ_getFrequencyByGroup(pGroup_Freq, metadata, 'LFQ intensity', group = 'condition')pGroup_Freq = jinspect.MQ_getFrequencyByGroup(pGroup_Freq, metadata, 'Intensity', group = 'condition')pGroup_Freq[['ENTREZGENE_gPro primary'] + [i for i in pGroup_Freq.columns if 'Freq' in i]].head(2)| ENTREZGENE_gPro primary | iBAQ Freq: b3e3_nCL | iBAQ Freq: b3e3_254 | iBAQ Freq: b4e3_nCL | iBAQ Freq: b4e3_254 | iBAQ Freq: b3e4_nCL | iBAQ Freq: b3e4_254 | iBAQ Freq: b4e4_nCL | iBAQ Freq: b4e4_254 | LFQ intensity Freq: b3e3_nCL | ... | LFQ intensity Freq: b4e4_nCL | LFQ intensity Freq: b4e4_254 | Intensity Freq: b3e3_nCL | Intensity Freq: b3e3_254 | Intensity Freq: b4e3_nCL | Intensity Freq: b4e3_254 | Intensity Freq: b3e4_nCL | Intensity Freq: b3e4_254 | Intensity Freq: b4e4_nCL | Intensity Freq: b4e4_254 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | HDLBP | 0 | 4 | 0 | 4 | 0 | 4 | 0 | 4 | 0 | ... | 0 | 4 | 0 | 4 | 0 | 4 | 0 | 4 | 0 | 4 |
| 1 | RPS9 | 0 | 4 | 0 | 4 | 4 | 4 | 1 | 4 | 0 | ... | 1 | 4 | 0 | 4 | 0 | 4 | 4 | 4 | 1 | 4 |
2 rows × 25 columns
#### Now we can classify RBP; the console will report if all proteins/genes have been properly classified or not.RBP_Dict_b3e3 = jtest.MQ_applyClassifyRBP(pGroup_Freq, 'LFQ intensity', metadata, 'b3e3', 'b3e3_nCL', 'b3e3_254', 3, add_cols = ['ENTREZGENE_gPro all', 'ENTREZGENE_gPro primary', 'ENTREZGENE_gPro name'])All proteins classified = True
#### Now we can classify RBP; the console will report if all proteins/genes have been properly classified or not.RBP_Dict_b3e4 = jtest.MQ_applyClassifyRBP(pGroup_Freq, 'LFQ intensity', metadata, 'b3e4', 'b3e4_nCL', 'b3e4_254', 3, add_cols = ['ENTREZGENE_gPro all', 'ENTREZGENE_gPro primary', 'ENTREZGENE_gPro name'])All proteins classified = True
#### Now we can classify RBP; the console will report if all proteins/genes have been properly classified or not.RBP_Dict_b4e3 = jtest.MQ_applyClassifyRBP(pGroup_Freq, 'LFQ intensity', metadata, 'b4e3', 'b4e3_nCL', 'b4e3_254', 3, add_cols = ['ENTREZGENE_gPro all', 'ENTREZGENE_gPro primary', 'ENTREZGENE_gPro name'])All proteins classified = True
#### Now we can classify RBP; the console will report if all proteins/genes have been properly classified or not.RBP_Dict_b4e4 = jtest.MQ_applyClassifyRBP(pGroup_Freq, 'LFQ intensity', metadata, 'b4e4', 'b4e4_nCL', 'b4e4_254', 3, add_cols = ['ENTREZGENE_gPro all', 'ENTREZGENE_gPro primary', 'ENTREZGENE_gPro name'])All proteins classified = True
#### The results of this classifcation can be found in a dictionary of dataframes for each output#### See help() for an explanation of each datasetRBP_Dict_b3e3.keys()dict_keys(['Input_df_annStatus', 'Summary_df_annStatus', 'I', 'IIa', 'IIb', 'IIc', 'NC', 'ND'])
#### We want the most general overview which of our classes per treatment and so concatenate the results from each Summary_df_annStatusRBP_Class = pd.concat([RBP_Dict_b3e3['Summary_df_annStatus'], RBP_Dict_b3e4['Summary_df_annStatus'], RBP_Dict_b4e3['Summary_df_annStatus'], RBP_Dict_b4e4['Summary_df_annStatus']], axis=0, join='outer')#### And represent them in a barplotfrom matplotlib import pyplot as pltplt.figure('rbp class')sns.set_style('whitegrid')sns.set(font_scale=1.1)ax = sns.countplot(x='MQgroup', hue = 'RBP Class', data=RBP_Class.sort_values(by = ['MQgroup', 'RBP Class']), palette = cpal['RBP_Class'], edgecolor = 'black')ax.set_ylabel('Unique Gene Count')ax.set_title('Unique Genes detected per RBP Class')RBP_Class.head(1)| ENTREZGENE_gPro all | ENTREZGENE_gPro primary | ENTREZGENE_gPro name | Gene names | RBP Class | RBP subClass | MQgroup | |
|---|---|---|---|---|---|---|---|
| 0 | HDLBP;nan | HDLBP | high density lipoprotein binding protein [Sour... | HDLBP | I | b3e3 |
#### Because each subclass of the class II RBP are identified based on different statistical assumptions we can more closely inspect if those assumptions trend differently among the different treatments.plt.figure('rbp class')sns.set_style('whitegrid')sns.set(font_scale=1.1)ax = sns.countplot(x='MQgroup', hue = 'RBP subClass', data=RBP_Class[RBP_Class['RBP subClass']!=''].sort_values(by=['MQgroup','RBP subClass']), palette = cpal['Set3_qual'], edgecolor = 'black')ax.set_ylabel('Unique Gene Count')ax.set_title('Unique Genes detected per RBP Class')RBP_Class.head(1)| ENTREZGENE_gPro all | ENTREZGENE_gPro primary | ENTREZGENE_gPro name | Gene names | RBP Class | RBP subClass | MQgroup | |
|---|---|---|---|---|---|---|---|
| 0 | HDLBP;nan | HDLBP | high density lipoprotein binding protein [Sour... | HDLBP | I | b3e3 |
x
## 15. Compare RBP Identity Between Conditions B3 and B4 Treatments Here we use the dictionaries output by __jtest.MQ_applyClassifyRBP( )__ for the purposes of creating a Venn Diagram. 15. Compare RBP Identity Between Conditions B3 and B4 Treatments
Here we use the dictionaries output by jtest.MQ_applyClassifyRBP( ) for the purposes of creating a Venn Diagram.
from matplotlib import pyplot as pltimport numpy as npfrom matplotlib_venn import venn3, venn3_circlesplt.figure(figsize=(6,6))v = venn3([set(RBP_Dict_b3e4['NC']['ENTREZGENE_gPro primary']), set(RBP_Dict_b3e3['I']['ENTREZGENE_gPro primary']), set(RBP_Dict_b4e3['I']['ENTREZGENE_gPro primary'])], set_labels = ('b3e4', 'b3e3', 'b4e3'), alpha = 0.5)c = venn3_circles([set(RBP_Dict_b3e4['NC']['ENTREZGENE_gPro primary']), set(RBP_Dict_b3e3['I']['ENTREZGENE_gPro primary']), set(RBP_Dict_b4e3['I']['ENTREZGENE_gPro primary'])], linestyle='dashed')c[0].set_lw(1.5)c[0].set_ls('dotted')c[1].set_lw(1.5)c[1].set_ls('dotted')c[2].set_lw(1.5)c[2].set_ls('dotted')for text in v.set_labels: text.set_fontsize(18)for text in v.subset_labels: text.set_fontsize(14)xxxxxxxxxx16. Review Basic Gene Ontology
Functions
jwrangle.importMixedFiles( )
jweb.fetchQuickGO_stats( )
In this section we will explore Gene Ontology (GO) memberships for the observed proteins. There is little use in applying statistical tests such as Gene Ontology Enrichment Analysis (GOEA) for these experiments; the combination of selection by RNA interaction and the comparative lack of deep RBP validation for many candidates would make such a study rather spurious. We can, however, investigate the frequency with which our identified RBPs appear in previous studies. In addition, we can use this frequency to further assess whether conditions b3 and b4 are equivalent or not.
A number of GO-specific and utility functions are provided to help with retrieving Gene Ontologies from the QuickGo database. In this section we'll look at the most basic.
The function jweb.fetchQuickGO_stats( ) will fetch the annotation statistics for all records belonging to the gene ID from a submitted list. Reviewing these statistics before beginning an analysis is ideal, because it contextualises the breadth of future analyses. This function returns these statistics in the from of a dictionary which can be converted to a dataframe for easier viewing by using the dedicated function jweb.getQuickGO_stats( ).
In addition to contextualising the search space of subsequent analyses, fetching the annotation numbers is important for checking that the number of records, per GO ID, falls below 10000. This is because QuickGo will not allow larger searches to be done programmatically. If your GO ID of interest has many more records users should retrieve their records manually. These details will be covered fourther in the next section.
#### I like to keep a table of interesting GO terms in a local csv file, let's find itMyResources = jwrangle.importMixedFiles(base_path / 'my_resources')MyResources.keys()dict_keys(['.ipynb_checkpoints', 'control_elements', 'control_proteins', 'custom_dfs', 'FASTA', 'GOexplore.ipynb', 'GO_TermsOfInterest.csv', 'GO_TermsOfInterest_stats.csv', 'Hentze_core_RBP_list.txt', 'Hentze_core_RBP_list_uniprot.txt', 'Hentze_core_RBP_plusE1_8.txt', 'instrumentQC', 'limma_normalizeTIs_p3550_p3592_p3657.nb.html', 'limma_normalizeTIs_p3550_p3592_p3657.Rmd', 'TIs_p3550_p3592_p3657.csv', 'TIs_p3550_p3592_p3657_loess.csv'])
MyResources['GO_TermsOfInterest.csv'].head(9)| GO Term | relationship | depth | go ID | Description | ctrl list type | parent(s) | |
|---|---|---|---|---|---|---|---|
| 0 | binding | Molecular Function | 1 | GO:0005488 | NaN | NaN | GO:0003674 |
| 1 | protein binding | Molecular Function | 2 | GO:0005515 | NaN | NaN | GO:0005488 |
| 2 | chromatin binding | Molecular Function | 2 | GO:0003682 | NaN | NaN | GO:0005488 |
| 3 | nucleic acid binding | Molecular Function | 3 | GO:0003676 | NaN | NaN | GO:0003676 |
| 4 | DNA binding | Molecular Function | 4 | GO:0003677 | NaN | general | GO:0003676 |
| 5 | RNA binding | Molecular Function | 4 | GO:0003723 | NaN | general | GO:0003676 |
| 6 | DNA/RNA hybrid binding | Molecular Function | 4 | GO:0071667 | NaN | general | GO:0003676 |
| 7 | translation regulator activity | Molecular Function | 4 | GO:0090079 | NaN | NaN | GO:0003676 |
| 8 | regulatory region nucleic acid binding | Molecular Function | 4 | GO:0001067 | NaN | NaN | GO:0003676 |
#### Let's get basic statistics on the broader RNA-binding category "GO:0003723" MyGO_Stats_Dict = jweb.fetchQuickGO_stats(['GO:0003723'], QG_geneProductType = 'protein', QG_taxonId = '9606', QG_geneProductSubset = ['Swiss-Prot', 'TrEMBL'])MyGO_Stats_Dict.keys()dict_keys(['GO0003723_Swiss-Prot', 'GO0003723_TrEMBL'])
#### To view basic information, like the number of annotations associated with the GO termRNA_Binding_stats = jweb.getQuickGO_stats(MyGO_Stats_Dict)RNA_Binding_stats| go ID | Swiss-Prot_annotations | Swiss-Prot_genesProducts | TrEMBL_annotations | TrEMBL_genesProducts | |
|---|---|---|---|---|---|
| 0 | GO:0003723 | 5321 | 1689 | 5439 | 2807 |
xxxxxxxxxxResult: GOOD
The SwissProt and TrEMBL counts are below 10000 records so, conveniently, we can continue the tutorial with the simplest scenario for record extraction.
xxxxxxxxxx17. Retrieve GO Records
Functions
jweb.fetchQuickGO( )
jwrangle.concatGO_DataFrameDict( )
jweb.mapQuickGO( )
jwrangle.AnnotateDataFrameCtrls( )
jinspect.MQ_getFrequencyBySample( )
We can see above that none of our downloaded IDs have exceeded our fetch limit of 10000 records. If any did, we would need to manually retrieve the tsv file.
For the GO terms we wish to investigate to investigate jweb.fetchQuickGO( ) will fetch the relevant proteins and also apply the same onsistent gene ID mapping algorithm we used earlier when remapping the MaxQuant IDs. This process is important because it allows us to provide consistent gene IDs to subsequent sets analysis. Where the writetopath options is used, the results from any searches will be saved to a local downloads folder- this is useful as a snapshot of the annotations used at the time or as a simple way of avoiding fetching the records via API in the future.
The function jwrangle.concatGO_DataFrameDict( ) let's us concatenate a dictionary created by jweb.fetchQuickGO( ). This is useful when the user wishes to pool both SwissProt and TrEMBL records for a given
The function jwrangle.AnnotateDataFrameCtrls( ) taks our QuickGo records and annotates each protein or gene in a given dataframe (i.e. a proteinGroups table) for membership to the searched GO ID. The function jinspect.MQ_getFrequencyBySample( ) acts similarly to the peptide and gene count functions used earlier- it will sum the frequency of memberships to each GO catgeory and report the counts as a modified metadata table.
A Note on Manual Downloads
The jweb.fetchQuickGO( ) will also output to console the html address that one should use if a manual download if required; this address will encompass all the relevant record characteristics that typically used by this suite. Once this table has been downloaded, the function jweb.mapQuickGO( ) can be used to provide the same ID mapping service as performed in jweb.fetchQuickGO( ). The returned elements will also be manipulated into the same format.
#### Download and create a local copy of all the records associated with our GO query. QuickGo_dict = jweb.fetchQuickGO(['GO:0003723'], QG_geneProductType = 'protein', QG_taxonId = '9606', QG_geneProductSubset = ['Swiss-Prot', 'TrEMBL'], gConvertOrganism='hsapiens', writetopath= True)QuickGo_dict.keys()GO0003723_Swiss-Prot has 5321 records: https://www.ebi.ac.uk/QuickGO/annotations?goId=GO:0003723&taxonId=9606&taxonUsage=exact&goUsage=descendants&geneProductType=protein&geneProductSubset=Swiss-Prot GO0003723_TrEMBL has 5439 records: https://www.ebi.ac.uk/QuickGO/annotations?goId=GO:0003723&taxonId=9606&taxonUsage=exact&goUsage=descendants&geneProductType=protein&geneProductSubset=TrEMBL
dict_keys(['GO0003723_Swiss-Prot', 'GO0003723_TrEMBL'])
### We'll assess our RBP frequency by considering both SwissProt and TrEmbl records, thus let's concatenate the dictionaries for GO:0003723QuickGo_dict_concat = jwrangle.concatGO_DataFrameDict(QuickGo_dict)QuickGo_dict_concat.keys()dict_keys(['GO0003723'])
#### With our GO records in hand we next investigate whether our identified proteins are members of GO:OOO3723#### This can be done with any DataFrame, but here we'll use the last version of our modified proteinGroups table pGroup_GO = jwrangle.AnnotateDataFrameCtrls(pGroup_Freq, QuickGo_dict_concat, search_match = 'ENTREZGENE_gPro primary', dict_match = 'ENTREZGENE_gPro primary', none_col = 'GO_None')pGroup_GO.keys()D:\MEGA\Programming\Scripts_JS\RBP_SUITE\modules\jwrangle.py:40: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy ctrl_nohits_df[none_col] = True #add zero annotation column to annotation columns
dict_keys(['ann_df', 'ann_subdf'])
#### The ann_df contains the master dataframe. The sub_df contains a slice for each positively identified Gene in each GO category searched.#### Let's view what those membership annotations look likepGroup_GO['ann_df'][['ENTREZGENE_gPro primary']+[i for i in list(pGroup_GO['ann_df'].columns) if 'GO' in i]].head(2)| ENTREZGENE_gPro primary | GO0003723 | GO_None | |
|---|---|---|---|
| 0 | HDLBP | True | False |
| 1 | RPS9 | True | False |
#### Now to find out the counts for each GO ID being searched we will modify our metadata table in a fashion in the same way as for prptide and gene counting.#### We'll use iBAQ for these counts but, given all intensities have previously been filtered by LFQ membership both 'Intensity' or 'LFQ intensity' would give the same result.metaStatsGo = jinspect.MQ_getFrequencyBySample(pGroup_GO['ann_df'], metaStats, freqList = list(QuickGo_dict_concat.keys()) + ['GO_None'], measure = 'iBAQ')metaStatsGo.iloc[:,1:]| condition | replicate | sample | measure | MQgroups | Peptides | Intensity | iBAQ | LFQ intensity | GO0003723 | GO_None | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | b3e3_nCL | A | b3e3_nCL-A | Intensity | b3e3 | 180.0 | 31 | 31 | 31 | 23 | 8 |
| 1 | b3e3_nCL | B | b3e3_nCL-B | Intensity | b3e3 | 156.0 | 32 | 32 | 32 | 24 | 8 |
| 2 | b3e3_nCL | C | b3e3_nCL-C | Intensity | b3e3 | 155.0 | 50 | 50 | 50 | 36 | 14 |
| 3 | b3e3_nCL | D | b3e3_nCL-D | Intensity | b3e3 | 125.0 | 51 | 51 | 51 | 32 | 19 |
| 4 | b3e3_254 | A | b3e3_254-A | Intensity | b3e3 | 13799.0 | 1255 | 1255 | 1255 | 884 | 371 |
| 5 | b3e3_254 | B | b3e3_254-B | Intensity | b3e3 | 13735.0 | 1253 | 1253 | 1253 | 890 | 363 |
| 6 | b3e3_254 | C | b3e3_254-C | Intensity | b3e3 | 13854.0 | 1285 | 1285 | 1285 | 897 | 388 |
| 7 | b3e3_254 | D | b3e3_254-D | Intensity | b3e3 | 13928.0 | 1528 | 1528 | 1528 | 970 | 558 |
| 8 | b3e4_nCL | A | b3e4_nCL-A | Intensity | b3e4 | 9859.0 | 697 | 697 | 697 | 312 | 385 |
| 9 | b3e4_nCL | B | b3e4_nCL-B | Intensity | b3e4 | 11362.0 | 751 | 751 | 751 | 318 | 433 |
| 10 | b3e4_nCL | C | b3e4_nCL-C | Intensity | b3e4 | 11206.0 | 758 | 758 | 758 | 337 | 421 |
| 11 | b3e4_nCL | D | b3e4_nCL-D | Intensity | b3e4 | 11832.0 | 961 | 961 | 961 | 381 | 580 |
| 12 | b3e4_254 | A | b3e4_254-A | Intensity | b3e4 | 15220.0 | 1177 | 1177 | 1177 | 760 | 417 |
| 13 | b3e4_254 | B | b3e4_254-B | Intensity | b3e4 | 15259.0 | 1193 | 1193 | 1193 | 770 | 423 |
| 14 | b3e4_254 | C | b3e4_254-C | Intensity | b3e4 | 15127.0 | 1209 | 1209 | 1209 | 777 | 432 |
| 15 | b3e4_254 | D | b3e4_254-D | Intensity | b3e4 | 15312.0 | 1471 | 1471 | 1471 | 856 | 615 |
| 16 | b4e3_nCL | A | b4e3_nCL-A | Intensity | b4e3 | 261.0 | 54 | 54 | 54 | 39 | 15 |
| 17 | b4e3_nCL | B | b4e3_nCL-B | Intensity | b4e3 | 263.0 | 58 | 58 | 58 | 43 | 15 |
| 18 | b4e3_nCL | C | b4e3_nCL-C | Intensity | b4e3 | 275.0 | 64 | 64 | 64 | 48 | 16 |
| 19 | b4e3_nCL | D | b4e3_nCL-D | Intensity | b4e3 | 263.0 | 88 | 88 | 88 | 56 | 32 |
| 20 | b4e3_254 | A | b4e3_254-A | Intensity | b4e3 | 14110.0 | 1303 | 1303 | 1303 | 901 | 402 |
| 21 | b4e3_254 | B | b4e3_254-B | Intensity | b4e3 | 13672.0 | 1259 | 1259 | 1259 | 887 | 372 |
| 22 | b4e3_254 | C | b4e3_254-C | Intensity | b4e3 | 13632.0 | 1286 | 1286 | 1286 | 901 | 385 |
| 23 | b4e3_254 | D | b4e3_254-D | Intensity | b4e3 | 13798.0 | 1533 | 1533 | 1533 | 974 | 559 |
| 24 | b4e4_nCL | A | b4e4_nCL-A | Intensity | b4e4 | 13312.0 | 834 | 834 | 834 | 341 | 493 |
| 25 | b4e4_nCL | B | b4e4_nCL-B | Intensity | b4e4 | 13628.0 | 858 | 858 | 858 | 347 | 511 |
| 26 | b4e4_nCL | C | b4e4_nCL-C | Intensity | b4e4 | 13209.0 | 847 | 847 | 847 | 348 | 499 |
| 27 | b4e4_nCL | D | b4e4_nCL-D | Intensity | b4e4 | 13925.0 | 1072 | 1072 | 1072 | 413 | 659 |
| 28 | b4e4_254 | B | b4e4_254-B | Intensity | b4e4 | 11277.0 | 995 | 995 | 995 | 672 | 323 |
| 29 | b4e4_254 | C | b4e4_254-C | Intensity | b4e4 | 16097.0 | 1220 | 1220 | 1220 | 790 | 430 |
| 30 | b4e4_254 | D | b4e4_254-D | Intensity | b4e4 | 16280.0 | 1522 | 1522 | 1522 | 902 | 620 |
| 31 | b4e4_254 | F | b4e4_254-F | Intensity | b4e4 | 16250.0 | 1219 | 1219 | 1219 | 786 | 433 |
#### To plot these countsjvis.BarPlotByGroup_sbplot(metaStatsGo, x_col = 'condition', y_col = 'GO0003723', title = 'GO:0003723, RNA-Binding', pal = set2_paired)<Figure size 432x288 with 0 Axes>
#### Calculate the % GO0003723 members per groupmetaStatsGo['% RBP'] = metaStatsGo.apply(lambda row: round(100*row['GO0003723']/(row['GO0003723']+row['GO_None'])), axis = 1)#### Plotjvis.BarPlotByGroup_sbplot(metaStatsGo, x_col = 'condition', y_col = '% RBP', title = 'GO:0003723, RNA-Binding', pal = set2_paired, yrange = [0,100])<Figure size 432x288 with 0 Axes>